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Preface 



Welcome to the proceedings of the 8th European Conference on Computer Vi- 
sion! 

Following a very successful ECCV 2002, the response to our call for papers 
was almost equally strong - 555 papers were submitted. We accepted 41 papers 
for oral and 149 papers for poster presentation. 

Several innovations were introduced into the review process. First, the num- 
ber of program committee members was increased to reduce their review load. 
We managed to assign to program committee members no more than 12 papers. 
Second, we adopted a paper ranking system. Program committee members were 
asked to rank all the papers assigned to them, even those that were reviewed 
by additional reviewers. Third, we allowed authors to respond to the reviews 
consolidated in a discussion involving the area chair and the reviewers. Fourth, 
the reports, the reviews, and the responses were made available to the authors as 
well as to the program committee members. Our aim was to provide the authors 
with maximal feedback and to let the program committee members know how 
authors reacted to their reviews and how their reviews were or were not reflected 
in the final decision. Finally, we reduced the length of reviewed papers from 15 
to 12 pages. 

The preparation of ECCV 2004 went smoothly thanks to the efforts of the or- 
ganizing committee, the area chairs, the program committee, and the reviewers. 
We are indebted to Anders Heyden, Mads Nielsen, and Henrik J. Nielsen for 
passing on ECCV traditions and to Dominique Asselineau from ENST/TSI who 
kindly provided his GestRFIA conference software. We thank Jan-Olof Eklundh 
and Andrew Zisserman for encouraging us to organize ECCV 2004 in Prague. 
Andrew Zisserman also contributed many useful ideas concerning the organiza- 
tion of the review process. Olivier Faugeras represented the ECCV Board and 
helped us with the selection of conference topics. Kyros Kutulakos provided hel- 
pful information about the CVPR 2003 organization. David Vernon helped to 
secure ECVision support. 

This conference would never have happened without the support of the 
Centre for Machine Perception of the Czech Technical University in Prague. 
We would like to thank Radim Sara for his help with the review process and 
the proceedings organization. We thank Daniel Vecerka and Martin Matousek 
who made numerous improvements to the conference software. Petr Pohl helped 
to put the proceedings together. Martina Budosova helped with administrative 
tasks. Hynek Bakstein, Ondfej Chum, Jana Kostkova, Branislav Micusik, Stepan 
Obdrzalek, Jan Sochman, and Vft Zyka helped with the organization. 
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Abstract. Among image restoration literature, there are mainly two 
kinds of approach. One is based on a process over image wavelet coef- 
hcients, as wavelet shrinkage for denoising. The other one is based on 
a process over image gradient. In order to get an edge-preserving reg- 
ularization, one usually assume that the image belongs to the space of 
functions of Bounded Variation (BV). An energy is minimized, composed 
of an observation term and the Total Variation (TV) of the image. 
Recent contributions try to mix both types of method. In this spirit, 
the goal of this paper is to define a unified-framework including together 
wavelet methods and energy minimization as TV. In fact, for denoising 
purpose, it is already shown that wavelet soft-thresholding is equivalent 
to choose the regularization term as the norm of the Besov space . In 
the present work, this equivalence result is extended to the case of decon- 
volution problem. We propose a general functional to minimize, which 
includes the TV minimization, wavelet coefficients regularization, mixed 
(TV-fwavelet) regularization or more general terms. Moreover we give 
a projection-based algorithm to compute the solution. The convergence 
of the algorithm is also stated. We show that the decomposition of an 
image over a dictionary of elementary shapes (atoms) is also included in 
the proposed framework. So we give a new algorithm to solve this diffi- 
cult problem, known as Basis Pursuit. We also show numerical results of 
image deconvolution using TV, wavelets, or TV-|-wavelets regularization 
terms. 



1 Introduction 

1.1 Image Restoration 

Restoring images from blurred or/and noisy data is an important task of image 
processing. In the important literature developed since twenty years, most 
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approaches are based on an energy minimization. Such energy contains mainly 
two terms: the first term models how the observed data is derived from the 
original data one would like to reconstruct; the second term contains a priori 
information on the regularity of this original data. At this point, two important 
families of criteria emerge. In the first family the regularity criterion is a semi- 
norm that is expressed in a “simple” way in terms of the wavelet coefficients of 
the image (usually a Besov norm). This leads to a restoration process that is 
performed through some processing of the wavelet coefficients, such as a wavelet 
shrinkage (for example see [5] in denoising, [10] in deconvolution, [8] in Radon 
transform inversion). 

In the second family, the regularity criterion is a functional of the gradient of 
the image, so that the resolution of the problem amounts to solving some more 
or less complex PDE. In order to get an edge-preserving regularization, one 
usually assumes that the image belongs to the space of functions of Bounded 
Variation (BV) and the criterion which is minimized is the Total Variation 
(TV) of the image (see [12] for example). 

Recent contributions try to mix both types of method [9,14,6]. In this 
spirit, the goal of this paper is to define a unified-framework including together 
wavelet, TV, or a more general semi-norm. In fact, as it is shown in [2] for 
denoising and compression purposes, wavelet soft-thresholding is equivalent 
to choose the regularization term as the norm of the Besov space B\^ . In the 
present work, this equivalence result is extended to the case of deconvolution 
problem. The proposed framework allows to include the TV minimization, 
mixed (TV-1- wavelet) regularization or more general terms. Moreover we give a 
projection-based algorithm to compute the solution in the more general case. 
The convergence of algorithm is also stated. 

Image restoration can be considered as the minimization of a functional writ- 
ten as 

■;^\\g — Au\\\^ + \u\y ( 1 ) 

A is a linear operator which can model the degradation during the observation 
of the object u: 



X — ^ Al 
u I — > g = Au + rj 



(2) 



X is the space describing the objects to be reconstructed and Xi the space of 
observations, r] is the acquisition noise. Typically X = Xi = or X = Xi a 
finite-dimensional space. As in [2], \u\y is a norm or a semi-norm in a smoothness 
space Y. Standard example is V = = 2 defining quadratic regularization 

as proposed by Tikhonov [15]. Now, if Y is the BV space and s = 1, the solution 
is the one such that Au best approximates g (in the sense of the norm |] |]jfJ, 
with minimal Total Variation [12]. This general functional includes also wavelet 
shrinkage denoising/deconvolution methods by considering A = I (where I is the 
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identity operator) or A is the Point Spread Function of the transfert function of 
the optics and V is the Besov space and s = 1 [2]. Notice that if A defines 
a decomposition over a dictionary of possible atoms from which the signal u is 
built, (for example wavelet packets for textures, curvelets or bandlets for edges, 
and so on), then solving (1) corresponds exactly to the Basis Pursuit DeNoising 
algorithm (BPDN) by Chen, Donoho and Saunders [3]. 

1.2 Problem Statement 

In this paper, we study the minimization of a functional of the form 

■^\\g — Au\\x^ + J{u) (3) 

where J : X ^ RU+{oo} is a semi-norm on X. For sake of simplicity, we assume 
in the whole paper that u is a discrete image that is to say X = X\ = 
and the symbol ||.|| will denote any Hilbertian norm. 

In order to minimize the functional (3), we describe a projection-based al- 
gorithm which extends the one proposed by Chambolle for the denoising case 
{A = I) with TV regularization [1]. 

The convergence of the algorithm is proved. This gives a new algorithm 
to solve several kinds of image processing: image deconvolution with TV 
regularization, with wavelet shrinkage, or with both kind of regularization; 
BPDN problem as decribed by Donoho in [3]. During the review process of the 
paper, our attention was drawn by S. Mallat to the independent works [7,4] 
which derive, by different approaches, essentially the same iterative algorithm 
as the one described in this paper. In [4], a strong convergence of the iterative 
algorithm is shown in infinite dimension. One difference is that our algorithm 
includes TV or mixed (TV-1- wavelet) regularization which seems not to be the 
case in [7,4]. 

In section 2, we recall some basic tools in convex analysis. The main contri- 
bution of this paper is detailed in section 3 where the minimization algorithm 
is given for the general functional (3). In section 4, we show that several 
standard methods in image restoration are special cases of the unified energy 
(3), and numerical results are given for deconvolution with TV plus wavelet 
regularization. 



1.3 Notations 



Let us fix some notations. A discrete image will be denoted by Uij, = 

In order to define the TV of the discrete image u, we introduce the gradient 
X : X ^ X X X defined by: 






0 



Ui+l,j Uij if i < N (T7y\2 _ 

0 iii = N ^ 

We also introduce a discrete version of the divergence operator defined, by 
analogy with the continuous case, by div = —V* where V* is the adjoint of V. 



if j < N 
if j = N 
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We have 



( p]j - pI_ij ifl <i<N ( pIj - pIj_^ if 1 < j < iV 
{div{p)),j = I pIj ifi=l +{pIj ifj=l (4) 

l-A-i.i ifi=N I-pL- 1 ifj=N 

The discrete TV denoted Jtv is defined as the Z^-norm of the vector Vm by 



Jtv{u) = llVulli 



N . 

i,i = l 



(5) 



2 Some Tools of Convex Analysis 



We recall in this section some usual tools in convex analysis which are used to 
build our algorithm. We refer the reader to Rockafellar [11] for a more complete 
introduction of convex analysis. 

Definition 1 (Legendre-Fenchel Conjugate). Let (f> be an application X 
KU {+oo}. We assume that (p ^ +oo. The conjugate function of p is defined as 
(/)* : V — >■ K U {+oo} by: 



(j)*{s) = sup {(s, a;) - (6) 

x£X 



<p* is convex and lower semi-continuous (Isc). 

Definition 2 (Indicator function, support function). 

Let K a X be a non empty closed convex subset of X. The indicator function of 
K, called \k, is defined by: 



r 0 if X G K 

( - 1-00 otherwise 



(7) 



We call support function of K the function denoted 5 k, defined by: 



5k{s) = sup (s, x) (8) 

x£K 

The link between xk and 5k is given by the following results 

Theorem 1. Let K C X a non empty convex closed subset of X. Then the 
functions \k and 5k are convex, Isc and mutually conjugate. 



Theorem 2. All support functions 5 k, associated to a non empty convex closed 
subset K, is convex and one-homogeneous (e.g. Vt > 0,Vx G X, 5K{tx) = 
t5K{x)) and Isc. Conversely, each function (f> ^ -l-oo convex, one-homogeneous 
and Isc is the support function of a closed convex set defined by: 



= {sGV, y X G X, {s, x) < (p{x) } . 



(9) 
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For example, in (3), we have supposed that J{u) is a semi-norm. Therefore, 
it is a convex one-homogeneous and Isc function. So J(u) is the support function 
of a closed convex set Kj defined by (9). If J{u) is the discrete TV semi-norm 
given by (5), then 

Ktv = { div(p), pG X X X, \p,j\ < 1 Vz, j } . (10) 

We now introduce the notion of sub-differential of a function which general- 
izes the differential for convex functions. 

Definition 3 (Sub-differential). Let 4> o, convex function. We define the sub- 
differential d(j>{x) of (j) in X € X by: 

s G d(j){x) <1=^ y x' G X, 4>{x') > 4>{x) {s, x' — x) (11) 

Note that if x is such that (f>{x) < oo and if 4> is differentiable in x, then: 

d<f{x) = {V4>{x)}. (12) 

3 Algorithm and Convergence Resnlt 

This section is devoted to the main contribution of this paper, namely the de- 
scription and the convergence of our algorithm for numerically solving the min- 
imization problem (3). Before doing that, in order to justify our algorithm, we 
need some preliminary results. 



3.1 Preliminary Results 

Theorem 3. Let B : X ^ X be a linear self-adjoint and positive operator 
satisfying ||i?|| < 1. Then 

yuGX, {Bu,u) = min |||zt — -I- (Czc, w)| (13) 

w^x L J 

where C = B {L — B) ^ . Moreover, the minimum is reached at a unique point 
Wu which verifies: 

Wu = {I + C)-^{u) = {I-B){u). (14) 



Let us recall that the functional we want to minimize is given by 

^ h - Auf + Ji-u) 

Let ^ > 0 be such that 



p\\A*A\\ < 1 



(15) 
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and 



B = nA*A (16) 

i? is a self-adjoint positive operator. From hypothesis(15), ^ is such that ||i3|| < 
1. Thanks to Theorem 3, we will be able to write the data term of (3), ||5 — 

Aup), as the result of a new minimization problem, w.r.t an auxiliary variable 
w. We have 





\\Auf = {A*Au, u) 


(17) 




= — { Bu, u ) 
h 


(18) 




= — min 1 ||m — wlp -I- ( Cw, re )) 

^ w^X ■' 


(19) 


with C = B{I — B) 


. Therefore 






1 2 

-^\\g-Au\\ = min H{u,w) 

ZA wGX 


(20) 


where iF is the convex differentiable function defined by: 






(llu-wf -k(Cw, w)) -k ^ (llsf - 2{Au, 


• (21) 


Let us denote 


= I - B = I - fxA*A 


(22) 



From relation (14), w minimizes H{u, .) if and only if ic = Wiu. Let us now 
consider the function F defined by: 



F{u,w) = F[{u,iv) + J{u) (23) 

F’ is a convex continuous function, and we deduce from the previous prelim- 
inary results, the following proposition: 

Proposition 1. w minimizes F{u, .) defined in (21) and (23) if and only if 
w = Fiu where Wi = I — iiA*A and we have: 

Vic yf <Fiu, F{u,Fiu) < F{u,w) (24) 

Let us now show that computing the global minimizer of F reduces to 
minimize each of its partial functions F(.,w) and F{u,.). This is a non triv- 
ial result even in the case of a strictly convex function (consider for instance 
f{x, y) = (x^ + y‘^)l2 + \x - y\ at x = y = 1/2). 

Proposition 2. (u,w) minimizes F if and only if: 

( u minimizes F{. , w) 

\ w minimizes F{u, . ) 



(25) 
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Proof. Since F is the sum of two convex continuous functions, 
dF{u,w) = dP[{u,w) + dJ{u,w). 

As F[ is differentiable and J does not depend on w, we deduce: 

dF{u,w) = {V uH{u,w) ^ V wH{u,w)) + 9J(m) x {0} 



So 



0 G dF{u, w) 
which is exactly what we want to show. 



0 G VuH{u,w) + dJ{u) 
0 = VwH{u, w) 



(26) 



(27) 



(28) 



The last result we need in order to derive our algorithm is the following: 

Proposition 3. Let us denote by F 2 : X ^ X the application defined by: 

^2{w) = {I - + pLA*g) . (29) 

Here Hx^Kj (w) stands for the orthogonal projection of w on the convex set 
\piKj, where Kj is the convex set associated to J{u) (see (9)). Then u minimizes 
F{. ,w) if and only if u = F 2 {w) . 

The expression (29) is found by computing the dual problem of miun F{u, w), 
for fixed w (see [1]). 



3.2 The Algorithm for the Minimization of the Unified Functional 

We are now able to describe the algorithm we propose to minimize the unified 
functional (3). Based on results given in Propositions 1, 2 and 3, we propose the 



following iterative algorithm to minimize (3) 

Wn = {I - fxA* A) (m„) (30) 

Un+I = {I - IIxf,Kj)iwn + frA*g) (31) 

By a change of notation, using + p.A*g, it results: 

Vn = Un + p,A*{g-Aun) (32) 

^n+1 {h TI\^Kj^ i^n) (^^) 



In practice, we will use the algorithm (32)-(33) rather than the writting 
(30)-(31) and we use the numerical algorithm (35)-(36) described in section 3.3 
to compute 

The first equation of this algorithm is a fixed-step descent algorithm, con- 
sidering only the minimization of the data term \\g — Artp. The step is fixed by 
the parameter /i. The second equation corresponds to a denoising step over the 




8 



J. Beet et al. 



current estimates Remark that the parameter considered in the denoising 
step is rather than A as it should be suggested looking at the functional (3). 
We can also observe that in the case where A*A is invertible then (32)-(33) 
correspond to a contraction with a ratio 1 — /iAo, where Aq is the smallest eigen- 
value of A*A. In the case where A* A is not invertible, then the transformation 
(32)~(33) is 1-Lipschitz. In either situations the following theorem holds. 

Theorem 4 (Convergence of the algorithm). Let /i > 0 and assume 

t^\\A*A\\ < 1 (34) 

Then the algorithm (32) -(33) converges to a global minimizer of (3). 

Before ending this section, let us remark that we always have the existence 
of a minimizer of the functional (3) in the discrete setting. However the difficult 
point for the convergence proof comes from the fact that the minimum is non 
necessarily unique. 

3.3 Projection Algorithm of Chambolle 

We give in this section the numerical algorithm to compute a projection II\Kj , in 
the case of a regularizing term expressed as J{u) = HQuHi = l(Q“)t>l) where 

Q is a linear operator Q ■. X ^ O and 0 is a product space (see section 4.3 for 
more details on the notations). The projection onto the convex closed set \Kj, 
where Kj is associated to ||(5u||i can be numerically computed by a fixed point 
method, based on results in [1]. We build recursively a sequence in O of vectors 
Pn = {Pn, 0)0 in the following way: we choose po € Bq = {p € 0 : |pe| < 1 V0} 
and for each n > 0 we let 



Qn = Q{Q*Pn - ^) 



and for each 9 



Pn+1,0 — 



Pn,0 - r{q„,e) 



(35) 



(36) 



1 + T\q„^e\ 

We have a sufficient condition ensuring the convergence of the algorithm: 

Theorem 5. Assume that the parameter r in (36) verifies t < -^ where k = 
||Q*||. Then for all initial condition pq € Bq, the algorithm (35)-(36) is such 



\Q*Pn — XQ*p = IIxKjig) 



(37) 



4 Applications 

For TV regularization, one just needs to apply the algorithm (32)-(33) with K = 
Ktvj so that the projection algorithm is given by (35)-(36) with 0 = X x X, 
Q = V, Q* = — div (as described in [1]). Let us now look at regularization in 
the wavelet domain. 
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4.1 Wavelet Shrinkage 

Let us consider the case where J{u) is the norm in the Besov space B\^ . In [2], 
it is shown that this norm is equivalent to the norm of the sequence of wavelet 
coefficients 







(38) 



f/' G = W defines bi-dimensional wavelets from a one-dimensional 

wavelet and a one-dimensional scaling function as usual. The set of functions 
{tpj,k{x) = forms an orthogonal bases for 

Then, for / G L^(R^), we have 



For sake of simplicity, the range of the indexes is omitted: we work with discrete 
functions with bounded definition domain. Assume that our purpose is image 
deconvolution that is to say A is a convolution operator representing the transfert 
function of the optics. Then if we want to deconvolve the observed image g with 
a wavelet regularization term, we have to minimize an energy of the form 

^||g-Auf +|Hlsii. (40) 

We know (see for example [2]), that if A = J, minimizing (40) is equivalent 
to a soft-thresholding algorithm. Let us now see what happens with algorithm 
(32)~(33) and a general operator A. The convex set Ki associated to the norm 
in B\^ is defined by (see (9)) 

Ki = G X /Mx G X, < s,x >< 11x11^11 I . (41) 

We easily deduce that 

iLi = {sGX/Vj,/c,^, < 1}. (42) 

where d’ are the wavelet coefficients of s 

Therefore equation (33) is a denoising step by soft-thresholding with thresh- 
old Xg.. Then the algorithm iteratively computes a step of steepest gradient de- 
scent only for the deconvolution and then a denoising step by soft-thresholding. 
This algorithm is very easy to implement. 

4.2 Basis Pursuit DeNoising (BPDN) 

Representing a signal in terms of few high coefficients of a dictionary and a lot 
of vanishing coefficients allows representation of an image ensuring better per- 
formances of shrinkage methods or other restoration methods. The problem is 
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to decompose a signal over a possibly large dictionary rather than one orthog- 
onal basis. The dictionary should contain all possible atoms which can be used 
to represent any images. For example we can use in the dictionary DCT, DST, 
biorthogonal wavelets, wavelet packets, curvelets, and so on. Searching this rep- 
resentation is an ill-posed problem, since such a decomposition is non unique. 
In the Basis Pursuit DeNoising algorithm (BPDN) [3] the authors propose the 
following regularizing functional 

+ l|a||i (43) 

a ZA 

The function g is the signal to be decomposed, a the unknown coefficients 
and <P the operator <P : a i — > y = where (j>i are the elements of the 

dictionary. 

The minimization (43) can be performed by using algorithm (32)-(33). Since the 
regularization is a l^-norm, the step (33) is simply a soft-thresholding. 

In [3], is proposed the algorithm IP (Interior Point) to solve (43). This al- 
gorithm is slow. A faster algorithm called BCR (Block Coordinate Relaxation) 
has been proposed in [13]. As algorithm (32)-(33), BCR is based on a soft- 
thresholding of the coefficients. In BCR, it is assumed that the dictionary is 
composed of a union of orthogonal bases. Our algorithm is more general since it 
can be applied by using any dictionary. 

We have compared these three algorithms on some ID-signals. The IP algorithm 
is available on the web {http://www-stat.stanford.edu/atomizer) . We have chosen 
the same bases (wavelet transform, DCT and DST) for the three algorithms. On 
a ID signal of 4096 samples, it appears that the convergence is much faster for 
algorithm (32)-(33) than the IP algorithm, and a little bit smaller than the BCR 
one. We may loose in time what we gain in generality. Of course these are very 
few results and much more experiments must be conducted for the comparison. 

4.3 Z^-Regularization 

In this section, we show that our algorithm can be applied to a general class of 
semi-norm J{u) which is relevant in real problems. We will consider what we call 
Z^-regularization, which consists in the minimization of the following functional 

^\\g-Aur + \\Quh. (44) 

Q is a linear application Q : X ^ 0 , where 0 is the product space defined as: 

0 = R”'’ (45) 

l<e<r 



endowed with the norm 



pG0 > — ^ IIpIIi = \P\<^ 

K6<r 



(46) 
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where |-| is the Euclidean norm on K”" and p = (j3e)i<^<r ) Pe G 

IIQmIIi is a semi-norm over X and is a norm when Q is injective. 

For example, if Q is a wavelet transform, {Qu)g are scalar coefficients and then 

\\Qu\\i = I I (47) 

9 

is a norm if the sum runs over all coefficients of the wavelet transform or a semi- 
norm otherwise. 

In the TV case we have 



ligulli = llVulli (48) 

This general framework also includes a regularization composed of a sum of a 
TV term and a wavelet term. For such a regularization, we will set 0 = x X, 
and Q : X ^ 0 is defined as: 

g : a: — ^ X X (49) 

“-(d-\T,r„) (“) 

where W stands for an orthonormal wavelet transform. We use the norm || • ||i 
such that: V {p^,p^) G X^, Vw G X, 

||(p,Hlli = lblli + Hli 

= H ^ (pIjT + (pljf + 

and the global regularization functional is 

j^{u) = 7||vm||i -k (1 - 7) ||itm||i = iigwiii (53) 

Note that (53) defines a family of functional J^, which goes continuously from 
/^-norm on the wavelet coefficient to the Total Vacation as 7 goes from 0 to 1. 
We show restoration results for deconvolution by using this functional for three 
values of 7. The Lena image has been blurred by a PSF (Point Spread Function) 
corresponding to a synthetic aperture optical system, with vanishing coefficients 
in the medium frequencies as well as in the high frequencies. A Gaussian white 
noise has been added with standard deviation cr = 0.05 (for u values in [0, 1]). 
This deconvolution problem is very difficult because of vanishing medium fre- 
quencies of the degradation, and a large amount of noise. We retrieve for 7 = 0 
and 7=1 the specific drawbacks of wavelets and TV restoration respectively: 
blur and bad edges for the wavelets, loose of textures for TV. The value 7 = 0.2 
has been chosen by hand and gives a good compromise. The choice of this pa- 
rameter is an open problem. The regularizing parameter A is estimated following 
the ideas of [ 1 ]. 



(51) 

(52) 





Original image 



blurred image 



Observations 

(blurred image + noise, a = 0.05) 



Deconvolution with 



Deconvolution with 7 = 0.2 



Deconvolution with 



Fig. 1. Results of the algorithm on a deconvolution problem 
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5 Conclusion 

We have presented a general functional unifying several approaches of image 
restoration. A convergent and easy to implement algorithm has been proposed 
for the minimization of this functional. For a good evaluation of our algorithm 
in terms of quality and rapidity for several applications, much more results will 
be conducted in each specific application as deconvolution or BPDN. 
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Abstract. A novel generalization of linear scale space is presented. The 
generalization allows for a sparse approximation of the function at a 
certain scale. 

To start with, we first consider the Tikhonov regularization viewpoint on 
scale space theory [15]. The sparsification is then obtained using ideas 
from support vector machines [22] and based on the link between sparse 
approximation and support vector regression as described in ]4] and [19]. 
In regularization theory, an ill-posed problem is solved by searching for 
a solution having a certain differentiability while in some precise sense 
the final solution is close to the initial signal. To obtain scale space, a 
quadratic loss function is used to measure the closeness of the initial 
function to its scale a image. 

We propose to alter this loss function thus obtaining our generalization 
of linear scale space. Comparable to the linear e-insensitive loss func- 
tion introduced in support vector regression [22], we use a quadratic 
e-insensitive loss function instead of the original quadratic measure. The 
e-insensitivity loss allows errors in the approximating function without 
actual increase in loss. It penalizes errors only when they become larger 
than the a priory specified constant e. The quadratic form is mainly 
maintained for consistency with linear scale space. 

Although the main concern of the article is the theoretical connection 
between the foregoing theories, the proposed approach is tested and ex- 
emplified in a small experiment on a single image. 



1 Introduction 

There are many extensions, variations, perturbations, and generalizations of lin- 
ear scale space [10,24]. E.g. anisotropic, curvature, morphological, a, pseudo, 
Poisson, i, torsion, geometry driven and edge preserving scale spaces [1,2, 3, 7, 9, 
13,14,16,17,23]. All these approaches serve one or more purposes in image pro- 
cessing, general signal processing or computer vision, as a whole covering a large 
area of applications. This paper presents another interesting possibility for gen- 
eralization of linear scale space that has not been explored up to now. I.e., those 
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that allow for a sparse scale space representation of the function under consid- 
eration. Being sparse means that the approximation can be obtained using only 
a small collection of building block or basis functions. This sparseness is being 
enforced by setting certain constraints to the solution. 

The proposed technique exploits links between scale space and support vector 
machines. The sparsified representation thus obtained is called the support blob 
machine (SBM), as blobs are the main building blocks used in the representation. 
The principal observation that leads to the notion of SBMs is that both support 
vector regression (SVR) [22,19] and scale space can be related to Tikhonov’s 
regularization theory [21] (see [4], [15], and [20]). For more on the connection 
between scale spaces and regularization see [18]. 

Regularization is a technique typically used for solving ill-posed problems in 
a principled way. While scale space offers a solution to the problem of defining 
derivatives of a multidimensional (digital) signal in a well-posed way, the goal in 
regression is mainly to recover a function from a finite, possibly noisy, sampling 
of this function, which is also clearly ill-posed. Support vector regression (SVR) 
not only offers a well-posed solution to the regression problem but has the addi- 
tional advantage that it can be used to give a sparse solution to the problem at 
hand. The sparse representation is acquired by allowing the regressed function 
to deviate from the initial data without directly penalizing such deviations. This 
sparseness behavior is accommodated for through the so-called e-insensitive loss. 
The resulting behavior is clearly different from, for example, standard linear re- 
gression in which, using a quadratic loss function, only the slightest deviation 
from the given data is penalized immediately. 

This article focuses on the relationships mentioned above, deriving the SBMs, 
providing a computational scheme to determine these scale spaces, and finally 
exemplifying the scale spaces obtained. However, before doing so, we mention 
several reasons why the kind of scale spaces proposed are of interest. 

First of all, the approach may lead to improved image feature detectors. 
SVR, like support vector classifiers, has proven to be very successful in many 
applications which is partly due to the relation with robust statistics [8]. But not 
only may certain forms of sparsified scale space lead to more robust detection of 
edges, blobs and other visual cues, another advantage is that one could build on 
the structural risk minimization framework as proposed by Vapnik [22] and con- 
sequently one may be able to theoretically underpin the practical performance 
of these detectors in real-world applications. 

Another useful application is in the area of feature-based image analysis and 
the related research into metamery classes and image reconstruction [6,11,12]. 
E.g., in [12], certain greedy methods are discussed for the selection of points 
of interest in an image, i.e., based on a certain form of reconstruction energy 
(that can be calculated for every point individually), the representing points are 
chosen starting with the one with the highest energy and going down gradually. 
These points, together with their associated receptive field weighting function 
are then taken as image representation. A greedy approach to the point selection 
problem was considered appropriate by the authors, because features must be 
detected and represented individually in early vision. However, they also note 
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that this approach is not necessarily optimal for image representation purposes 
as the reconstruction information mutually conveyed by two or more points is 
not taken into account, leading to only a suboptimal solution for the image 
reconstruction task. The SBM does take the global information into account 
and can therefore obtain an improved selection of points of interest, which all in 
all could prove useful for, e.g., image compression. 

Finally, a little more speculative, the data reduction that is achieved by the 
sparse representation of the data may facilitate the use of otherwise prohibitively 
computer intensive techniques in computer vision or image analysis. As an ex- 
ample one could think of the registration of two large 3D data volumes based 
on their sparse representation instead of all of the voxels. 

The remainder of the paper is organized as follows. Subsection 2.1 gives the 
regularization formulation of scale space after which Subsection 2.2 discusses a 
more general formulation of regularized regularization taken from the SVR litera- 
ture. Section 2.3 links the aforementioned techniques. The quadratic e-insensitive 
loss functions is then introduced in Subsection 2.4 on which our principal gener- 
alizations is based. Section 3 describes some experiments to exemplify the novel 
scale space. Section 4 contains the discussions and concludes the article. 



2 The Sparsification of Scale Space 



2.1 Scale Space Regularization Formulation 

This subsection recapitulates one of the principle contributions of [15] in which 
scale space is related to a specific instance of Tikhonov regularization. The au- 
thors consider the general regularization formulation given by Tikhonov [21]: 
The regularized function / associated to the function g on R" minimizes the 
functional S defined as 



S[h] := 




\ dx'^ 



2 



dx . 



( 1 ) 



All Xj are nonnegative and J is an n-index used for denoting derivatives of order 
I J|. The first term of the right hand side penalizes any deviation of the function 
h from the given function (the data) g. The second part does not involve g and 
is the regularization term on h, which, in a certain way controls the smoothness. 

It can be shown in this setting, that the solution to the problem, /, can be 
obtained by a linear convolution of g. Moreover, the authors prove that / equals 
g * Gt, where the latter is the Gaussian kernel 



Gt{x) 




(2) 



if and only if 



_ _ P 

" Ml! " j! 



(3) 
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for all J, thus relating scale space at scale to a specific form of Tikhonov 
regularization. 

2.2 Regression and Regularization 

The regression problem is generally formulated in terms of a discrete data set 
of I (noisy) samples g{x^ from the function g and the goal is to recover this 
underlying function g merely based on these I samples [4,19,22]. To this end, 
an underlying functional form of g based on a set of linearly independent basis 
functions (pi is assumed, i.e., g can be represented as 

OO 

g{x) + Co . (4) 

i=l 

The constant term cq and the parameters Cj have to be estimated from the 
data. This is clearly an ill-posed task since the problem as such has an infinite 
number of solutions. Again, regularization can be used to turn it into a well- 
posed problem by imposing smoothness constraints on the final solution. The 
regularized solution / minimizes the functional TZ 

i 

n[h] := C' ^ L{g{xi) - h{xi)) + lA[h] , (5) 

i=l 

where L is the loss function penalizing deviation of / from the measurement data 
g, yl is a general constraint that enforces smoothness of the optimal solution /, 
and C is a positive constant that controls the tradeoff between the two previous 
data terms. 

An important result is the following (see [4]). If the functional A has the form 
2 

^[h] = V > where all are positive and is a decreasing sequence, 

then the solution / to the regularization problem (5) takes on the form 

t 

/(x) = ^OiA'(a;,Xj) -l-co, (6) 

i^\ 

with the kernel function K being defined as 

OO 

K{x,Xi) = ^Y^\p^{x)ip^{xi) . ( 7 ) 

i=l 

A large class of regularizations can be defined based on the foregoing class of 
smoothing functionals and in the remainder of the article, only these kind of 
smoothness constraints are considered. 

Note that if the cost function L is quadratic, the unknown parameters in 
Equation (6) can be determined by solving a linear system comparable to stan- 
dard regression. When the cost function is not quadratic, the can not be 
readily obtained by solving a linear system and one has to resort to different op- 
timization methods. For particular loss functions, as the main one considered in 
this article, TZ[h] can be optimized using quadratic programming (QP), allowing 
the optimization to be done in a fairly straightforward way (see Subsection 2.4). 
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2.3 Kernel Formulation of Scale Space 

Before formulating the sparsified version of scale space, first the regularizations 
in (1) and (5) are related to each other. That is, in the functional TZ, the ‘free 
parameters’ are chosen such that a solution / that minimizes this functional is 
equal to the minimizing solution of £ with the Aj as given in Equation (3). In 
this case 

e\H] = 1 f (M») - <,(»))» + E I E ■ («) 

i=i \J\=j ^ ^ 

The fact that one formulation is continuous and the other uses discrete observa- 
tions is disregarded. Doing so, it is of course immediately clear that in (5), the 
loss function L has to be the quadratic loss, i.e., 

= (9) 

So now our main concern is the smoothing term in both functionals. 

Starting with the smoothing term from (1) — with the Aj as defined in Equa- 
tion (3), based on induction and partial integration, and in addition properly 
rearranging terms, it can be shown that the following equivalence holds (cf. [15]) 

VZ\^ h(x)'] if j is odd 
: , .2 ^ . (10) 
A'^hix) J if j is even 

Where A is the Laplacean and V is the gradient operator. 

Setting 2t = cr^ in Equation (8) and substituting the results from Equation 
(10), the expression obtained can be related to the result discussed in [20,19] (cf. 
[5]) in which Gaussian functions are shown to be the kernel functions associated 
to this specific form of regularization. That is, K in (7) should be defined as 

K{x,Xi):=e 2 t 2 =e « , (11) 

for the regularized regression (5) to be equivalent to the Tikhonov regularization 
resulting in linear scale space of g. 

Finally, with the constant C in (5) set to 1, the regularization functionals £ 
and TZ become completely equivalent. 

2.4 Quadratic e-Insensitive Loss and SBMs 

Based on the foregoing equivalence, this subsection introduces to the generaliza- 
tion of linear scale space within the SVR framework via a quadratic e-insensitive 
loss: The support blob machines (SBMs). The main idea behind using this 
quadratic e-insensitive loss is that the generalization should possess a similar 
kind of ability to obtain sparse representations as the (linear) e-insensitive loss 
function exhibits. 
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It is formulated for pixel-based discrete images and related to SVR based on 
the (linear) e-insensitive loss. Subsequently, it is demonstrated how to obtain the 
minimizing solution under this loss function based on a quadratic programming 
(QP) formulation. This is similar to the optimization procedure used in standard 
e-insensitive loss support vector machines [4,22,19]. 

The loss function originally proposed in the context of SVR [22] is the e- 
insensitive loss j • j^. It allows for minimization of (5) using QP and is defined 
as 

This loss function bears a resemblance to some loss functions used in statistics 
which provide robustness against outliers [8] . In addition to this important prop- 
erty, the loss has another distinctive feature: It assigns zero cost to deviations of 
h from g that are smaller than e and therefore, every function h that comes closer 
than e to the i data points g{xi) is considered to be a perfect approximation. 

The similar quadratic e-insensitive loss function, more closely related to the 
well-known quadratic loss in Equation (9), can be defined as follows 



f 0 if |a:| < e 

( ([x] — e)^ otherwise 



Using this loss, deviations form the underlying data are essentially quadratically 
penalized. However, the e allows zero cost deviations from the data points g{xi), 
which, for the minimizing solution /, leads to several being zero in Equation 
(6). The number of Oi being equal to zero is dependent on the parameter e. The 
larger e is, the more at are zero. If e = 0, then at oc g{xi) (note that the Gaussian 
kernel in (11) is not normalized) and so in general will be nonzero. Taking e 
larger than zero, a sparse solution to the problem can be obtained (see [4] and 
[22] for the actual underlying mechanisms leading to sparseness). 

Taking all of the foregoing into consideration, the regularization functional 
for sparse scale space is now readily defined in its discrete form as 




with, in general, C equal to Taking e equal to zero, results in ordinary linear 
scale space. 

Exploiting the link with SVR, a dual QP formulation that solves 
argmin;,£i[/i]e can be stated (cf. [4,19]): 

l r 

argmin ^ e(a+ + a~)~ g{x,){af - a ~ ) 

Oi-7 7 — 1 - 

(15) 

^ ^ 1 
+ -0i])iK{xuXj) + -+5^j) 
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Fig. 1. On the right, the original image of Lenin. On the left is his right eye, the 50 
by 50 sub-image used to exemplify the SBMs. 



subject to ~ 0 After which the a~* and af* 

that optimize (15) determine the optimal solution to the minimization of the 
functional S^, i.e., (cf. Equation (6)) 

e 

f(^) = - <^7*)K{x,Xi) + 4- (16) 

i=l 

The optimal offset cj can be determined by exploiting the Karush-Kuhn- Tucker 
condition, after the solution to the foregoing problem has been obtained. The 
condition basically states that at the optimal solution of QP (15), the product 
of the dual variables and their constraints should vanish (see [19,22]). 



3 An Illustrative Experiment 

The SBMs are exemplified on a single, small gray value image. The image is 
a 50 by 50 sub-image taken from a larger image of Lenin (see Figure 1). The 
sub-image is Lenin’s right eye. The gray values of this image are scaled between 
0 and 255 for this experiment. Note that it is important to know the range the 
gray values are in, because the function minimizing with e > 0 is not invariant 
under (linear) intensity scalings, which is due to the e-insensitivity. 

For this image, for several settings of the parameters a (= -\/2t) and e, the 
support vectors are determined using the functional (14). Simultaneously, the op- 
timal values for the parameters af and a~ are obtained. These values, together 
with the Gaussian kernels, define the regularized image via Equation (16). 

Figure 2 plots the values af* — a~* in the position of the blob it supports in 
the SBM for varying a and e. Figure 3 gives the regularized images, which are 
actually blurred versions (as is clear from Equation (16)) of the images in Figure 
2. Put differently, the images in Figure 2 can be considered de-blurred images. 
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0.4 1 169 1 0.322 



11169 1 0.2748 



2.5 1 1 1 0.8616 




2.5 I 16 1 0.4572 




2.5 1 49 1 0.2856 





Fig. 2. Plots of the values af* — * given by the SBMs. On the top of every image 

the scale, the value of e, and the relative number of support vectors is given. 



Because it is not immediately clear from Figure 2 when there is actually a 
support vector present in a certain position, i.e., when a^* — a~* is not equal 
to zero. Figure 4 indicates in black the positions that contain a support vector. 

The additional text added at the top of every images in Figures 2 to 4 gives 
information on the scale cr, the e, and the relative amount of support vectors (as 
•| • |-). This last number is simply calculated by dividing the number of support 
vectors by 2500 = 50^. 



4 Discussion and Conclusion 

We introduced support blob machines (SBMs) based on a link between scale 
space theory and support vector regression, which are connected to each other 
via regularization theory. The SBMs give a sparsification of linear scale space by 
employing a quadratic e-insensitive loss function in its regularization functional. 
Through the sparseness obtained, the regularized function can be represented 
using only a small collection of building block or basis functions. 
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Fig. 3. The regularized images associated to a certain scale o and value of e. (The 
values are given on the top of every image. The relative number of support vectors is 
given here also as the last number.) 



Some SBM instances were exemplified on a small 50 by 50 image, and it 
was shown that the technique indeed obtains sparse representations of images 
at a certain scale. However, in our tests, the reduction of information that was 
attained is certainly not overwhelming and further research should be conducted 
before a definite conclusion about the performance of SBMs can be stated. A 
simple suggestion, which could lead to improved sparseness performance is to 
increase the parameter C in the functional. In the tests this was set to \ to keep 
a close link with standard scale space. A larger value for C leads automatically 
to a sparser representation of the underlying signal. 

The principal contribution of this article is, however, the formal link between 
two interesting techniques: scale space and support vector machines. This link 
could now be further exploited and more advanced regularization approaches 
may be considered. 

Our future research will focus on developing a formulation that gives a sparse 
representation while taking all scales into account simultaneously and not merely 
one scale at a time. This may, in combination with different types of loss func- 
tions, lead to a robust form of scale selection in combination with blob detection 





Support Blob Machines 



23 



0.4 1 1 1 0.9064 




6.25 1 1 1 0.7776 



0.4 1 16 1 0.8164 




2.5 1 16 1 0.4572 



6.25 1 16 1 0.4016 









0.4 1 49 1 0.6316 




0.4 1 100 1 0.4748 




0.4 1 169 1 0.322 




1 1 100 1 0.424 




1 1 169 1 0.2748 







2.5 1 49 1 0.2856 



2.5 1 100 1 0.1952 



2.5 I 169 1 0.0992 




Fig. 4. In black are the positions that contain a support vector of the SBMs. The last 
number on top of every image gives the relative area of the image that is black, i.e., it 
gives the relative number of support vectors. 



[13]. In addition to this, representations of higher order features may be incorpo- 
rated, i.e., not only blobs, which makes a more closer connection to the work in, 
for example, [12] (see also Section 1) in which several receptive field weighting 
function are to be chosen in such a manner to represent the image in an optimal 
way. In this, also anisotropic forms of SBMs may be of interest. 
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Abstract. We study an energy functional for computing optical flow that com- 
bines three assumptions: a brightness constancy assumption, a gradient con- 
stancy assumption, and a discontinuity-preserving spatio-temporal smoothness 
constraint. In order to allow for large displacements, linearisations in the two data 
terms are strictly avoided. We present a consistent numerical scheme based on two 
nested fixed point iterations. By proving that this scheme implements a coarse-to- 
fine warping strategy, we give a theoretical foundation for warping which has been 
used on a mainly experimental basis so far. Our evaluation demonstrates that the 
novel method gives significantly smaller angular errors than previous techniques 
for optical flow estimation. We show that it is fairly insensitive to parameter vari- 
ations, and we demonstrate its excellent robustness under noise. 



1 Introduction 

Optical flow estimation is still one of the key problems in computer vision. Estimating the 
displacement field between two images, it is applied as soon as correspondences between 
pixels are needed. Problems of this type are not only restricted to motion estimation, 
they are also present in a similar fashion in 3D reconstruction or image registration. 
In the last two decades the quality of optical flow estimation methods has increased 
dramatically. Starting from the original approaches of Horn and Schunck [11] as well 
as Lucas and Kanade [15], research developed many new concepts for dealing with 
shortcomings of previous models. In order to handle discontinuities in the flow field, 
the quadratic regulariser in the Horn and Schunck model was replaced by smoothness 
constraints that permit piecewise smooth results [1,9,19,21,25], Some of these ideas are 
close in spirit to methods for joint motion estimation and motion segmentation [10,17], 
and to optical flow methods motivated from robust statistics where outliers are penalised 
less severely [6,7], Coarse-to-fine strategies [3,7,16] as well as non-linearised models 
[19,2] have been used to tackle large displacements. Finally, spatio-temporal approaches 
have ameliorated the results simply by using the information of an additional dimension 
[18,6,26,10]. 

However, not only new ideas have improved the quality of optical flow estimation 
techniques. Also efforts to obtain a better understanding of what the methods do in detail, 

* We gratefully acknowledge partial funding by the Deutsche Forschungsgemeinschaft (DFG). 
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and which effects are caused by changing their parameters, gave an insight into how 
several models could work together. Furthermore, variational formulations of models 
gave access to the long experience of numerical mathematics in solving partly difficult 
optimisation problems. Finding the optimal solution to a certain model is often not trivial, 
and often the full potential of a model is not used because concessions to implementation 
aspects have to be made. 

In this paper we propose a novel variational approach that integrates several of the 
before mentioned concepts and which can be minimised with a solid numerical method. 
It is further shown that a coarse-to-line strategy using the so-called warping technique 
[7,16], implements the non-linearised optical flow constraint used in [19,2] and in image 
registration. This has two important effects: Firstly, it becomes possible to integrate the 
warping technique, which was so far only algorithmically motivated, into a variational 
framework. Secondly, it shows a theoretically sound way of how image correspondence 
problems can be solved with an efficient multi-resolution technique. It should be noted 
that - apart from a very nice paper by Lefebure and Cohen [14] - not many theoretical 
results on warping are available so far. 

Finally, the grey value constancy assumption, which is the basic assumption in optical 
flow estimation, is extended by a gradient constancy assumption. This makes the method 
robust against grey value changes. While gradient constancy assumptions have also been 
proposed in [23,22] in order to deal with the aperture problem in the scope of a local 
approach, their use within variational methods is novel. 

The experimental evaluation shows that our method yields excellent results. Com- 
pared to those in the literature, their accuracy is always significantly higher, sometimes 
even twice as high as the best value known so far. Moreover, the method proved also 
to be robust under a considerable amount of noise and computation times of only a few 
seconds per frame on contemporary hardware are possible. 

Paper Organisation. In the next section, our variational model is introduced, first by 
discussing all model assumptions, and then in form of an energy based formulation. 
Section 3 derives a minimisation scheme for this energy. The theoretical foundation of 
warping methods as a numerical approximation step is given in Section 4. An experi- 
mental evaluation is presented in Section 5, followed by a brief summary in Section 6. 



2 The Variational Model 

Before deriving a variational formulation for our optical flow method, we give an intuitive 
idea of which constraints in our view should be included in such a model. 

- Grey Value Constancy Assumption 

Since the beginning of optical flow estimation, it has been assumed that the grey 
value of a pixel is not changed by the displacement. 

I{x,y,t) = I{x + u,y + v,t + 1) (1) 

Here / : 17 C K denotes a rectangular image sequence, and w := {u, v, 1)^ 

is the searched displacement vector between an image at time t and another image 
at time t + 1. The linearised version of the grey value constancy assumption yields 
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the famous optical flow constraint [11] 

IxU + lyV + = 0 (2) 

where subscripts denote partial derivatives. However, this linearisation is only valid 
under the assumption that the image changes linearly along the displacement, which 
is in general not the case, especially for large displacements. Therefore, our model 
will use the original, non-linearised grey value constancy assumption (1). 

- Gradient Constancy Assumption 

The grey value constancy assumption has one decisive drawback: It is quite suscep- 
tible to slight changes in brightness, which often appear in natural scenes. Therefore, 
it is useful to allow some small variations in the grey value and help to determine 
the displacement vector by a criterion that is invariant under grey value changes. 
Such a criterion is the gradient of the image grey value, which can also be assumed 
not to vary due to the displacement [23]. This gives 

VI{x,y,t) ='VI{x + u,y + v,t + 1). (3) 

Here V = (dx,dy)^ denotes the spatial gradient. Again it can be useful to refrain 
from a linearisation. The constraint (3) is particularly helpful for translatory motion, 
while constraint (2) can be better suited for more complicated motion patterns. 

- Smoothness Assumption 

So far, the model estimates the displacement of a pixel only locally without taking 
any interaction between neighbouring pixels into account. Therefore, it runs into 
problems as soon as the gradient vanishes somewhere, or if only the flow in normal 
direction to the gradient can be estimated (aperture problem). Furthermore, one 
would expect some outliers in the estimates. Hence, it is useful to introduce as a 
further assumption the smoothness of the flow field. This smoothness constraint can 
either be applied solely to the spatial domain, if there are only two frames available, 
or to the spatio-temporal domain, if the displacements in a sequence of images are 
wanted. As the optimal displacement field will have discontinuities at the boundaries 
of objects in the scene, it is sensible to generalise the smoothness assumption by 
demanding a piecewise smooth flow field. 

- Multiscale Approach 

In the case of displacements that are larger than one pixel per frame, the cost func- 
tional in a variational formulation must be expected to be multi-modal, i.e. a min- 
imisation algorithm could easily be trapped in a local minimum. In order to find 
the global minimum, it can be useful to apply multiscale ideas: One starts with 
solving a coarse, smoothed version of the problem by working on the smoothed 
image sequence. The new problem may have a unique minimum, hopefully close 
to the global minimum of the original problem. The coarse solution is used as ini- 
tialisation for solving a refined version of the problem until step by step the original 
problem is solved. Instead of smoothing the image sequence, it is more efficient to 
downsample the images respecting the sampling theorem, so the model ends up in 
a multiresolution strategy. 

With this description, it is straightforward to derive an energy functional that pe- 
nalises deviations from these model assumptions. Let x := (x,y,t)^ and w := 
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(i 6 , f , 1)^. Then the global deviations from the grey value constancy assumption and the 
gradient constancy assumption are measured by the energy 

EData{u,v)= / ( |/(x + w) - J(x) ^ + 7 I V/(x + w) - V/(x) p) dx (4) 
Jo 

with 7 being a weight between both assumptions. Since with quadratic penalisers, out- 
liers get too much influence on the estimation, an increasing concave function is 

applied, leading to a robust energy [7,16]: 

EData{u,v)= [ S' ( |/(x + w) - 7(x) | ^ + 7 I V/(x + w) - V/(x) | ^) dx (5) 
Jn 

The function 'F can also be applied separately to each of these two terms. We use the 
function !7(s^) = x/s^ + which results in (modified) minimisation. Due to the 
small positive constant e, !7(s) is still convex which offers advantages in the minimisation 
process. Moreover, this choice of <7 does not introduce any additional parameters, since 
e is only for numerical reasons and can be set to a fixed value, which we choose to be 
0 . 001 . 

Finally, a smoothness term has to describe the model assumption of a piecewise smooth 
flow field. This is achieved by penalising the total variation of the flow field [20,8], which 
can be expressed as 

Esmooth{u,v) = [ (iVsul^ + iVaup) dx. ( 6 ) 

Jn 

with the same function for <7 as above. The spatio-temporal gradient V 3 := {dx, dy,dt)^ 
indicates that a spatio-temporal smoothness assumption is involved. For applications 
with only two images available it is replaced by the spatial gradient. 

The total energy is the weighted sum between the data term and the smoothness term 

E{Uj — Ejjfifd -\- OiEsrnooth (7) 

with some regularisation parameter a > 0. Now the goal is to find the functions u and 
V that minimise this energy. 

3 Minimi s ation 

3.1 Euler-Lagrange Equations 

Since E{u, v) is highly nonlinear, the minimisation is not trivial. For better readability 
we define the following abbreviations, where the use of z instead of t emphasises that the 
expression is not a temporal derivative but a difference that is sought to be minimised. 

4 :=a^/(x + w), 

ly := dyl{x + w), 

4 := /(x -f w) - /(x), 

4x T w), 

Ixy := dxyl(x + w), 

lyy := dyyl{:>i + w), 

Ixz ■■= dxl{yi + w) - dxl{yi), 
lyz ■■= dyl{x + w) - dyl{x). 



( 8 ) 
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According to the calculus of variations, a minimiser of (7) must fulfill the Euler-Lagrange 
equations 

+ 7(/i + ID) ■ {hh + + hyly.)) 

-a div + |V 3 t-|^)V 3 u) = 0, 

I' (Iz + li.Ixz + ^yz)) ' (lylz + li.lyylyz + Ixylxz)) 

-a div D'dVsD + |V 3 wnV 3 t;) = 0 



with reflecting boundary conditions. 

3.2 Numerical Approximation 

The preceding Euler-Lagrange equations are nonlinear in their argument w = {u,v, 1)^. 
A first step towards a linear system of equations, which can be solved with common 
numerical methods, is the use of fixed point iterations on w. In order to implement a 
multiscale approach, necessary to better approximate the global optimum of the energy, 
these fixed point iterations are combined with a downsampling strategy. Instead of the 
standard downsampling factor of 0.5 on each level, it is proposed here to use an arbitrary 
factor ?7 G (0, 1), what allows smoother transitions from one scale to the next^ Moreover, 
the full pyramid of images is used, starting with the smallest possible image at the coarsest 
grid. Let = (u*, 1)^, k = 0, 1, ... , with the initialisation w° = (0, 0, 1)^ at 

the coarsest grid. Eurther, let be the abbreviations defined in (8) but with the iteration 
variable instead of w. Then will be the solution of 

w'aiDD + liiiDD + (iDD)) ■ (iDD^ + liiLiD^ + i^xyiDD 

-a div -F |V3 w'=+1|2)V3u'=+i) = 0 

+ liilDD + (ID")D ■ + IxylDD 

-a div (if''(|V3u'=+ip -F |V3n'=+ip)V3v'=+i) = 0. 

As soon as a fixed point in w* is reached, we change to the next finer scale and use this 
solution as initialisation for the fixed point iteration on this scale. 

Notice that we have a fully implicit scheme for the smoothness term and a semi-implicit 
scheme for the data term. Implicit schemes are used to yield higher stability and faster 
convergence. However, this new system is still nonlinear because of the nonlinear func- 
tion and the symbols In order to remove the nonlinearity in first order 
Taylor expansions are used: 

Jk+l ^ Jfc Ik^yk Jfc 

jk+1 ^ rk I rfc j k , jk j k 

^xz ^xz ' ^xx^^ ' ^xy*^^ ’ 

jk+l ^ Tk , Tk j fc I Tk j fe 

where + du^ and + dv^. So we split the unknowns 

in the solutions of the previous iteration step and unknown increments du*, dv^ . 

* Since the grid size in both x- and y-direction is reduced by r), the image size in fact shrinks 
with a factor if at each scale. 
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For better readability let 



{'I'’) Data ■■= (( 4 ' + + lydv’^? 

+ 7 (( 4 ^ + iLdv^ + d%dv^? + {ly. + I^.ydu^ + lyydv’^?)), 

{'d'%^ooth ■■= 'Z''(|V3(ri'= + du^)\^ + |V3(r;'= + dv>^)W 



(10) 



where {^')^ata 1*® interpreted as a robustness factor in the data term, and (S'') 

as a diffusivity in the smoothness term. With this the first equation in system (9) can be 
written as 

0 = {^%ata ■ ( 4 " ( 4 " + + lydv^) ) 

+ 7 {'I''fData ■ {iLilL + iLdv^ + I^.ydv^) + I^.y{Iy. + I^ydu'^ + lyydv^ 

-adiv + dv^)) , (11) 

and the second equation can be expressed in a similar way. This is still a nonlinear system 
of equations for a fixed k, but now in the unknown increments du’^, dv^. As the only 
remaining nonlinearity is due to S'', and S' has been chosen to be a convex function, the 
remaining optimisation problem is a convex problem, i.e. there exists a unique minimum 
solution. 

In order to remove the remaining nonlinearity in S'', a second, inner, fixed point iteration 
loop is applied. Let du^’'^ := 0, := 0 be our initialisation and let du^'^^dv^'^ 

denote the iteration variables at some step 1. Furthermore, let (S'')^|jj^ and (S'Osmoot/i 
denote the robustness factor and the diffusivity defined in (10) at iteration k, 1. Then 
finally the linear system of equations in reads 

0 = • (4' {It + 

+ iitML + + 74\(4 + 

- « div ((4)4„„,,V3 (u'' + (12) 

for the first equation. Using standard discretisations for the derivatives, the resulting 
sparse linear system of equations can now be solved with common numerical methods, 
such as Gauss-Seidel or SOR iterations. Expressions of type /(x + w*^) are computed 
by means of bilinear interpolation. 



4 Relation to Warping Methods 

Coarse-to-fine warping techniques are a frequently used tool for improving the per- 
formance of optic flow methods [3,7,17]. While they are often introduced on a purely 
experimental basis, we show in this section that they can be theoretically justified as a 
numerical approximation. 
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In order to establish this relation, we restrict ourselves to the grey value constancy 
model by setting 7 = 0 . Let us also simplify the model by assuming solely spatial 
smoothness, as in [17]. Under these conditions, (11) can be written as 






|^div(0?'')Loot.V(u'= + du'=))\ 

Vdiv + ) 



(13) 



For a hxed k, this system is equivalent to the Euler-Lagrange equations described in [17]. 
Also there, only the increments du and dv between the hrst image and the warped second 
image are estimated. The same increments appear in the outer hxed point iterations of 
our approach in order to resolve the nonlinearity of the grey value constancy assumption. 
This shows that the warping technique implements the minimisation of a non-linearised 
constancy assumption by means affixed point iterations on w. 

In earlier approaches, the main motivation for warping has been the coarse-to-hne 
strategy. Due to solutions u and v computed on coarser grids, only an increment du and 
dv had to be computed on the hne grid. Thus, the estimates used to have a magnitude of 
less than one pixel per frame, independent of the magnitude of the total displacement. 
This ability to deal with larger displacements proved to be a very important aspect in 
differential optical flow estimation. 

A second strategy to deal with large displacements has been the usage of the non- 
linearised grey value constancy assumption [19,2]. Here, large displacements are allowed 
from the beginning. However, the nonlinearity results in a multi-modal functional. In 
such a setting, the coarse-to-hne strategy is not only wanted, but even necessary to better 
approximate the global minimum. At the end, both strategies not only lead to similar 
results. In fact, as we have seen above, they are completely equivalent. As a consequence, 
the coarse-to-hne warping technique can be formulated as a single minimisation problem, 
and image registration techniques relying on non-linearised constancy assumptions get 
access to an efficient multiresolution method for minimising their energy functionals. 



5 Evaluation 

For evaluation purposes experiments with both synthetic and real-world image data were 
performed. The presented angular errors were computed according to [5]. 

Let us start our evaluation with the two variants of a famous sequence; the Yosemite 
sequence with and without cloudy sky. The original version with cloudy sky was cre- 
ated by Lynn Quam and is available at ftp://ftp.csd.uwo.ca/pub/vision. It com- 
bines both divergent and translational motion. The version without clouds is available 
at http: //www. cs .brown. edu/people/black/images .html. 

Tab. 1 shows a comparison of our results for both sequences to the best results from 
the literature. As one can see, our variational approach outperforms ah other methods. 
Regarding the sequence with clouds, we achieve results that are more than twice as 
accurate as all results from the literature. For the sequence without clouds, angular errors 
below 1 degree are reached for the hrst time with a method that offers full density. The 
corresponding how helds presented in Fig. 1 give a qualitative impression of these raw 
numbers: They match the ground truth very well. Not only the discontinuity between the 
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Table 1. Comparison between the results from the literature with 100 % density and our results for 
the Yosemite sequence with and without cloudy sky. AAE = average angular error. STD = standard 
deviation. 2D = spatial smoothness assumption. 3D = spatio-temporal smoothness assumption. 



Yosemite with clouds Yosemite without clouds 



Technique 


AAE 


STD 


Technique 


AAE 


STD 


Nagel [5] 


10.22° 


16.51° 


Ju et al. [12] 


2.16° 


2.00° 


Horn-Schunck, mod. [5] 


9.78° 


16.19° 


Bab-Hadiashar-Suter [4] 2.05° 


2.92° 


Uras et al. [5] 


8.94° 


15.61° 


Lai-Vemuri [13] 


1.99° 


1.41° 


Alvarez et al. [2] 


5.53° 


7.40° 


Our method (2D) 


1.59° 


1.39° 


Weickert et al. [24] 


5.18° 


8.68° 


Memin-Perez [16] 


1.58° 


1.21° 


Memin-Perez [16] 


4.69° 


6.89° 


Weickert et al. [24] 


1.46° 


1.50° 


Our method (2D) 


2.46° 


7.31° 


Eameback [10] 


1.14° 


2.14° 


Our method (3D) 


1.94° 


6.02° 


Our method (3D) 


0.98° 


1.17° 



Table 2. Results for the Yosemite sequence with and without cloudy sky. Gaussian noise with 
varying standard deviations an was added, and the average angular errors and their standard 
deviations were computed. AAE = average angular error. STD = standard deviation. 



Yosemite with clouds Yosemite without clouds 





AAE 


STD 


(^71 


AAE 


STD 


0 


1.94° 


6.02° 


0 


0.98° 


1.17° 


10 


2.50° 


5.96° 


10 


1.26° 


1.29° 


20 


3.12° 


6.24° 


20 


1.63° 


1.39° 


30 


3.1T 


6.54° 


30 


2.03° 


1.53° 


40 


4.37° 


7.12° 


40 


2.40° 


1.71° 



two types of motion is preserved, also the translational motion of the clouds is estimated 
accurately. The reason for this behaviour lies in our assumptions, that are clearly stated 
in the energy functional: While the choice of the smoothness term allows discontinuities, 
the gradient constancy assumption is able to handle brightness changes - like in the area 
of the clouds. 

Because of the presence of second order image derivatives in the Euler-Lagrange 
equations, we tested the influence of noise on the performance of our method in the next 
experiment. We added Gaussian noise of mean zero and different standard deviations 
to both sequences. The obtained results are presented in Tab.2. They show that our 
approach even yields excellent flow estimates when severe noise is present: For the 
cloudy Yosemite sequence, our average angular error for noise with standard deviation 
40 is better than all results from the literature for the sequence without noise. 

In a third experiment we evaluated the robustness of the free parameters in our 
approach: the weight 7 between the grey value and the gradient constancy assumption, 
and the smoothness parameter a. Often an image sequence is preprocessed by Gaussian 
convolution with standard deviation a [5]. In this case, a can be regarded as a third 
parameter. We computed results with parameter settings that deviated by a factor 2 in 
both directions from the optimum setting. The outcome listed in Tab. 3 shows that the 
method is also very robust under parameter variations. 

Although our paper does not focus on fast computation but on high accuracy, the 
implicit minimisation scheme presented here is also reasonably fast, especially if the 
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Fig. 1 . (a) Top left: Frame 8 of the Yosemile sequence without clouds, (b) Top right: Corresponding 
frame of the sequence with clouds, (c) Middle left: Ground truth without clouds, (d) Middle right: 
Ground truth with clouds, (e) Bottom left: Computed flow field by our 3D method for the sequence 
without clouds, (f) Bottom right: Ditto for the sequence with clouds. 



reduction factor r] is lowered or if the iterations are stopped before full convergence. The 
convergence behaviour and computation times can be found in Tab. 4. Computations 
have been performed on a 3.06 GHz Intel Pentium 4 processor executing C/C++ code. 

For evaluating the performance of our method for real-world image data, the Ettlinger 
Tor traffic sequence by Nagel was used. This sequence consists of 50 frames of size 
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Table 3. Parameter variation for our method with spatio-temporal smoothness assumption. 



Yosemite with clouds 



(7 


a 


7 


AAE 


0.8 


80 


100 


1.94° 


0.4 


80 


100 


2.10° 


1.6 


80 


100 


2.04° 


0.8 


40 


100 


2.67° 


0.8 


160 


100 


2.21° 


0.8 


80 


50 


2.07° 


0.8 


80 


200 


2.03° 



Table 4. Computation times and convergence for Yosemite sequence with clouds. 



3D - spatio-temporal method 



reduction 
factor rj 


outer fixed 
point iter. 


inner fixed 
point iter. 


SOR 

iter. 


computation 

time/frame 


AAE 


0.95 


77 


5 


10 


23.4s 


1.94° 


0.90 


38 


2 


10 


5.1s 


2.09° 


0.80 


18 


2 


10 


2.7s 


2.56° 


0.75 


14 


1 


10 


1.2s 


3.44° 




Fig. 2. (a) Left: Computed flow field between frame 5 and 6 of the Ettlinger Tor traffic sequence. 
(b) Right: Computed magnitude of the optical flow field. 



512 X 512. It is available at http: //i21www. ira.uka.de/ image_sequences/. In Fig. 2 
the computed flow field and its magnitude are shown. Our estimation gives very realistic 
results, and the algorithm hardly suffers from interlacing artifacts that are present in 
all frames. Moreover, the flow boundaries are rather sharp and can he used directly for 
segmentation purposes by applying a simple thresholding step. 
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6 Conclusion 

In this paper we have investigated a continuous, rotationally invariant energy functional 
for optical flow computations based on two terms: a robust data term with a bright- 
ness constancy and a gradient constancy assumption, combined with a discontinuity- 
preserving spatio-temporal TV regulariser. While each of these concepts has proved 
its use before (see e.g. [22,26]), we have shown that their combination outperforms all 
methods from the literature so far. One of the main reasons for this performance is the use 
of an energy functional with non-linearised data term and our strategy to consequently 
postpone all linearisations to the numerical scheme: While linearisations in the model 
immediately compromise the overall performance of the system, linearisations in the 
numerical scheme can help to improve convergence to the global minimum. Another im- 
portant result in our paper is the proof that the widely-used warping can be theoretically 
justified as a numerical approximation strategy that does not influence the continuous 
model. We hope that this strategy of transparent continuous modelling in conjunction 
with consistent numerical approximations shows that excellent performance and deeper 
theoretical understanding are not contradictive: They are nothing else but two sides of 
the same medal. 
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Abstract. Classical techniques for the reconstruction of axisymmet- 
rical objects are all creating artefacts (smooth or unstable solutions). 
Moreover, the extraction of very precise features related to big density 
transitions remains quite delicate. In this paper, we develop a new 
approach -in one dimension for the moment- that allows us both to 
reconstruct and to extract characteristics: an a priori is provided thanks 
to a density model. We show the interest of this method in regard to 
noise effects quantification ; we also explain how to take into account 
some physical perturbations occuring with real data acquisition. 

Keywords: tomography, flexible models, regularization, deblurring. 



1 Introduction 



From the last ten years, teams of researchers have worked on tomographic re- 
construction of objects from a very little number of views ; the final goal being 
to delimit very precisely big transitions of density between the various materi- 
als [15,8,20,19,23,9] (typically in angiography) and also to restitute good values 
of the density field when the objects are not binary. 

The general context of our study is the reconstruction, from a single X-ray 
photograph, of an object with a symmetry of revolution ; here, we assume that 
X-rays are parallel (because the objects are sufficiently far from the emitter) 
and monoenergetic. This work is part of a hydrodynamic high yield test project 
where we study the dynamic behaviour of objects constrained by shock waves 
produced with explosives. Due to the very hostile experimental environment, 
there is only a single X-ray machine. So as to make out the signals received on 
detectors, we have to research, from the unique projection, the interfaces between 
the different areas of the object in order to labellize a posteriori the materials. 
Moreover, it is fundamental, for us, to estimate precisely their respective masses: 
this operation implies a very good knowlegde of the density field p : — >■ IR. 
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The data we get in our experiences are formed in the following way: 




The attenuations of the X-ray beam are given by 

where ^{x,y,z) is the attenuation coefficient at the point {x,y,z). As source 
illumination is monoenergetic, we can define a reference attenuation coefficient 
(p)re/) constant everywhere in the spatial domain. This allows us to write: 



att{x, z) = e 






where pref is the equivalent density of the reference material. The quantity 



y{x,z) 




( 1 ) 



is defined as the projection of the equivalent object. 



Remark 1. If p^ef is known at each point (x,y,z) and if the materials are 
labellized (very often thanks to expert analysis) then ^(x,y,z) is known and the 
whole density field can he obtained using the following conversion: 



p{x,y,z) = pref{x,y,z) X 



(^)re/ 



(2) 



The datas y of the reconstruction processes are biased by the systems of 
production and acquisition of X-ray photons. The two main perturbations are 
the additive noise on the projections and the presence of blur due to the 
X source and the detector (see [18] for more details). 

Under these hypotheses, tomographic reconstruction of axisymmetrical ob- 
jects from a single projection is technically achievable [1] (thanks to axisym- 
metry) but it remains very delicate: generally, this leads to an inverse problem 
which is well known to be ill-posed in the sense of Hadamard [13] because the 
solution sensitivity (to noise) is very high. 



Historically, in this context, Abel proposed in 1826 [1] a method based on 
the inversion of his tranform [3]. This approach has been improved more re- 
cently [5] [14] [11] so as to decrease the artefacts generated by noise on projec- 
tions. However, the results remain again too unstable. 
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Some authors [14] [10] [16] proposed also to adapt classical techniques used in 
“conventional” tomography (Fourier synthesis and filtered backprojection): the 
idea is to duplicate the unique projection to simulate acquisition from a large 
number of angles. All these reconstructions have in common to create loss of 
resolution while correlating noise leading to difficult segmentations. 

Thanks to an optimal meshing technique described in [7], it is also possible 
to get, for each plane section of the object, a reconstruction by Generalized 
Inversion based on a natural sampling in torus: 




with Y{x) = y{x , .) and X{r) = A(a/x^~+^) = pref{x,y , .). On each section, 
we have a relation between Y and X given by Y = HX, where H is the pro- 
jection matrix which is upper triangular and well conditionned. The solution is 
then simple and easy to compute as it consists in matrix inversion and multi- 
plication, but it is very unstable: the noise is amplified, merely near the axis of 
symmetry [7]. 

The poor quality of the estimated density field lead to the introduction of 
regularization processes. The very easy Tikhonov-based approaches [24] are not 
efficient enough here because the solution is too smooth. Jean Marc Dinten [7] 
used Random Markov Fields (in the definition of a priori energy in a MAP 
criterium) allowing to decrease noise influence while preserving high density 
transitions. His method is indeed efficient but their remain a lot of parameters 
whose regulation is not straightforward. 

The common characteristic of all the previous approaches is that they provide 
an equivalent density field pref which is not segmented in materials. So they ne- 
cessit a supplementary process of labellization obtained after contour extraction 
and expert analysis in order to correct the density thanks to equation 2. The 
consequence is that additional uncertainties, inherent to the contour extractor, 
are added on the final field p. 

Moreover, the blur present on attenuations (see section 4) is not taken into 
account (direct deblurring being not satisfactory) during the reconstruction pro- 
cess. The main effect, as shown in section 4, is to modify the estimated masses 
for all the materials. 

In this paper, we propose a new approach where we introduce an a priori 
on the shape of the objects: an axisymmetrical density model. First, we treat 
a ID technique where each plane section is processed in an independent way. 
In our experiences of high yield hydrodynamic, the shock wave propagation and 
multiple reflexion phenomena generate areas with approximatively linear varying 
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dcns't'os (u.a.) 




r'—lOC '2~&C r3~50 ri—2S r-sduises (pixes) 



Fig. 1. Example of ID density model 



densities, (which is confirmed by physicists expert analysis), so a ID realistic 
model, illustrated on figure 1, is built by juxtaposition of constant density areas 
and of linear varying density areas. 

In section 2, we detail this approach by fitting of model. In section 3, we 
present a sensitivity study of the parameters of the deformable model in the case 
where the data are noisy. This is compared with the uncertainties obtained when 
we use the results of generalized inversion (that will be our reference method in 
this paper). In section 4, an original way to achieve deblurring/reconstruction 
from blurred data is exposed. 



2 ID Reconstruction 

We have presented, formerly, “classical techniques” for the reconstruction of a 
plane section of an object in equivalent densities. We also have mentionned the 
necessity to labellize the materials so as to correct their density. 

We propose here a new approach that allows both to reconstruct and to 
extract the searched characteristics of the objects (radiuses of interfaces and 
densities dj as illustrated on figure 1) thanks to the introduction of an important 
a priori on the density. If we denote ui G K" (where n is the number of parameters 
of the ID model) the vector of radiuses rt and of densities dj, x the pixels’ 
abscissa, Y{x) the data (areal masses) and proju;{x) the projection model, the 
problem of reconstruction can be stated as follow: 

{ U! = argmin \\proju; — YW^ = argmin(£r^) 

(C) : 9{uj) > 0, ( 3 ) 

f2 = {(V G iR" / ioi < u; < uju} 

where uj is the solution (V) that minimizes ; The constraints (C) are used to 
limit the domain and to ensure the existence of all the areas during the process 
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(it’s to say > Tj+i, Vt). We can notice here that the criteria is defined 
continuously on 17 and performs a sub-pixel reconstruction. 

The analysis of minimization methods (simulated annealing [12], I.C.M. [2] 
and gradient descents [4] [22] [21]), leads us to choose gradient descents under 
inequality contraints because they are faster and easier to compute with con- 
straints (Lagrange multiplier theory) ; they also preserve the continuous aspect 
of the criteria. 

The main problem is that proj^^, and consequently are almost every- 
where on 17 but on a finite number of points: we can show that is infinitely 
differentiable with respect to the di and differentiable with respect to the Vi ev- 
erywhere but on Ti = |x|, where x are the discrete positions of the data. So as 
to get class on 17, we have proposed two kinds of regularizations. 

Remark 2. We will denote A -k B the convolution of A and B in f2 and A-k B 

OJ X 

the result of a spatial convolution. 



2.1 Regularization by Convolution 

The main idea is to find a function h : IR^ — >■ JR", class on 17, such that 
proji^j -k h\ is on 17. So, the new criteria defined by 



£2 = 



proju; -k h 



- Y 



(4) 



will have the desired property. An analysis of * /ij provides us a simple 

expression for h: 

rir 

= Y[hiD{r^) (5) 






where is the number of interface radiuses and h\D a kernel defined on JR. 
The kernel hio can then be expressed in the following way: 



hiD{r) = f 



(3 



if a; G [r — !3,r + [3] 



(6) 



(/3 is the regularization parameter), where / is a gaussian like function whose 
support is [—1,1]. 

This technique proved to be efficient as we have obtained the convergence 
of the process of minimization of the energy given by equation 4 (for /? > 1 
numerically). But, as expected, the final solution depends sometimes severely on 
the choice of the regularization parameter fj. 



2.2 Regularization with a Weighting Function 

In the previous subsection, we provide a way to solve our minimization problem. 
Unfortunately, we found that the final estimate of w was unacceptably dependent 
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on the regularization parameter j3. Here, we propose a new manner to regular- 
ize that is simpler and quite “transparent” (i.e. independent of regularization 
parameters) . 

Let u : M X f2 — JR be a function of class CP,p> 1, that equals zero in a 
neighbourhood of all the singular points then the criteria: 

^ I u{x, uj) (proj^{x) - Y (x)^ I , (7) 

xeT> ^ ^ 

where V is the set of measure points, is class on J7. 

The function u can be chosen as follow: 

rir 

U{x, ^) = 'W(X, Vi) (8) 

where u{x, 0) is an even function, equaling 0 in [0, e] and 1 in [ke, -|-oo[. Its graph 
(and the one of its first and second derivative with respect to r) is given by: 




This approach, much more faster than the previous, allowed us to solve our 
minimization problem. Moreover, it appears that the final solution is quite in- 
dependent of the choice of e and k that we fix respectively at 1 and 3. 

However, before reconstructing the object by ID fitting, we must calculate 
the optimal number of linear varying density areas. If we denote cr^ an estimate of 
the noise variance, we demonstrated that this is number is correct if the optimal 
value of the criteria £^(o3) is close enough to and if the sensitivity (to noise 
present on the projections) of the parameter u> (whose expression is given in the 
next section) is small. For our objects, a model with eight parameters and two 
linear-varying areas (see figure 1) is always optimal. 



3 Sensitivity to Noise 

Getting a good precision on the position of the interfaces is very important in our 
context. If the function u defined in 2.2 is built to be class on JR x JR", then 
the criteria given by equation 7 is on 17. The zero-crossing condition of the 
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gradient of in an acceptable openset leads to an implicit system F{uj,Y) = 0 
where F is continuously differentiable and has an inversible Jacobian matrix. Un- 
der these conditions, the implicit functions theorem [6] guaranties the existence 
of a function G (such that oj = G{Y)) that is differentiable with respect to Y (x) 
and whose derivative is: 



G'{Y) 



V duj^dujj ) V duJidYj 






(9) 



where / is the set J the set n the number of model parameters 

and N the number of data Y{x). 

In our case, we can assume that the additive noise on Y is gaussian, zero 
mean, spatially uncorrelated and stationary ( ~ (O, Si, = x so, the 

differential expression duj = G'{Y)dY allows us to compute the covariance matrix 
of oj: 

= G'{Y) xSbX G'iYf (10) 

So as to compare the precision on interfaces obtained with the present model- 
based approach and the classical approach (generalized inversion followed by 
contour extraction), we have also established the law of positions for this latter. 
This work is developped in an internal document whose main results are given 
here. To illustrate these results we generate our data by projection of the model 
given on figure 1. 

So as to compare the two reconstructions (from model-based and classical 
approaches) when the projections are noisy, we add a realistic gaussian noise 
of standard deviation 8, as presented on figure 2. The comparison of the recon- 
structions with fitting and generalized inversion are then illustrated on figure 3. 




Fig. 2. noisy projection of the model 



The strong unstability of the reconstruction obtained by generalized inversion 
(dotted lines) appears clearly whereas the model obtained by fitting (continuous 
line) is very similar to figure 1. For fitting, the parameter standard deviations 
(calculated with formula 10) are very low. We deduce absolute errors less than 
2% for densities di, and the variation of the interfaces positions does not exceed 
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Fig. 3. comparison of the reconstructions (Generalized inversion in dotted and our 
approach in continuous line) 



a half pixel. These results confirm the very good stability of reconstruction by 
fitting. For generalized inversion, error on density is between 10 and 90% and 
the standard deviation of interface position error is between 1 and 2 pixels. 

We can conclude that our approach provides undoubtedly a very im- 
portant increase in precision on the characteristic parameters that we are 
looking for. 



4 ID Deblurring of 2D Blur 



Blur is mainly due to the fact that the X-ray transmitter is not a pinpoint source 
of light; moreover, the detector acts as a low-pass filter. In the current section, 
we suppose that the blur kernel associated to those perturbations is circular 
symmetric with a known shape (from a specific experience). The origin of this 
perturbation is in the energy domain of X photons (i.e. attenuations of X photons 
going through the object). So, the blurred projection yuur is a function of the 
ideal projection y of the object and is defined by: 



yblur{x,z) 



(-)'“■ 




\p) 


X^Z 



( 11 ) 



This expression allows us to state a very important result: the total mass of the 
blurred object (Mhiur) is different from its real mass (M). This is due to the fact 
that: 

i^hlur — I ytlur 






M= / 3^ 



(12) 



and therefore M can’t be deduced directly from the data yuur- 

Mass retrieval of each materials constituting the object is one of the most 
important goal of our study. So the necessity to deblur the projections ytiur is 
evident. Classical operations like Wiener, RIF filtering, ... [17] do not provide 
satisfactory results in our context because Y exhibits very high frequencies, 
additive noise is quite white and blur kernels are quite narrow. 
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In this section, we first deal with the general problem of the deblurring/ 
reconstruction in one dimension from a projection Yb/ui. blurred with a kernel H 
(formula 11 written in ID). Afterwards, we develop the case of two kinds of 3D 
objects for which this process is achievable. 



4.1 The Problem in One Dimension 



So as to introduce a deblurring operation during reconstruction by fitting, we 
define the criteria in the following way: 






x^V I 



n 2 






(13) 



In order to use a gradient descent to compute the solution of our problem, we 
first need to verify the differentiability of with respect to w, and so to analyse 
its partial derivatives with respect to 



_ 

duji 

2xE 

x^T> 



).ln(e 






n 




)+Y(x) 


X 



( dproj^ 
V dwi 



) *H{x) 

/ X 

^H{x) 



(14) 



If we denote attuur the blurred attenuation of the object given by: 
attuur{x) = f — 

J IR 

then the only problematic term in the computation of the gradient is: 



/ dattuur\, , ( ( dproj^ 



] *iJ(x) 



(15) 



(16) 



because the derivatives of the projection do not exist for all the values of the 
parameter iv. We have shown that, in fact, the main difficulty is generically 
reduced to the case of a model with a constant density area whose parameters 
are called r and D, for which the expression of the previous equation turns out 
to be: 



datt 



blur 



dr 



(x) = 2.D.r J ( 



-r - ■ 






X H{x — t)(1t 



= / K{r,T,x)ch 
J — r 



(17) 



The function K{., r, .) can be integrated on [— r, r] so this expression shows that 
is differentiable if the convolution integral is performed on a continuous do- 
main. In conclusion, the criteria is numerically not differentiable with respect to 
the Ti- 
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In the two following items, we demonstrate that, from the definition of a 
non differentiable criteria, we can supply very good approximations of its “true” 
gradient (i.e. calculated continuously as in 17) and so ensure the convergence of 
the minimization scheme to the exact solution. 



Explicit Computation of the Gradient. If we denote tq a positive integer 
lower than r and belonging to the set T> (the set of points of measure x defined 
in 2.2), then a reformulation of equation 17 leads to: 



1 datt, 



blur 



2.D.r' 



dr 



/ To nr r — To 

K{r,T,x)dT + / K{r,T,x)dT + / K{r,T,x)di 
- Tn J Ta J —r 



computation by Discrete first rest Ri{x) R 2 (x) — Ri(—x) 
Fourier Transform 



The first term is easily computable and the only difficult issue is the rest Ri{x). 
Thanks to an integration of Ri{x) by parts, we finally get a numerically conver- 
gent integral and then the searched approximation of the gradient. 

The main drawback of this method is that we must have a formal expression 
of the blur kernel H , which is not the case in general. 



Computation in the Fourier Domain. Let’s recall the main problem in 
equation 16: the generic expression (x) does not exist for all x. But, its 

Fourier Transform is defined everywhere and is given by its cosine transform: 

dproju, ^ f I cos{2TTxf)dx = [cos(2nrfcos(e))d6 = TTxJo{2Trrf) 

or J-rVr^-x^ Jo 



We can now write the computation of blurred attenuation (eq. 17) if we adopt 
the following process: 



dattbiur _ / dproj^ 

dr \ dr 

FT i 

datUiur ( f) — I dproj, 
dr \J ' \ dr 



DFTi 



-k F[{x) 
i DFT 
xH{f) 



(18) 



where the convolution, in the Fourier domain, between e pP^°F uses 

a sampling of jg gjygj^ j^y inverse DFT of which 

finally allows us to provide an approximation of the gradient of £^. 

If we compare this technique with the one presented previously, we can notice 
that we don’t have to know continuously the blur kernel FJ. The only constraints 
come from the sampling of the Fourier Transform of ^Pq°^“ . It is indeed vanishing 
very slowly, so the cancellation of high frequencies generates small artefacts. 
However, these perturbations remain low enough not to disturb the minimization 
process. This approach is moreover the fastest one. 



In the following two subsections, we deal with two kinds of 3D objects for 
which an extension of ID deblurring/reconstruction by fitting is possible and 
moreover, once again, exact. 
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4.2 Application to “3D Cylindrical Objects” 

For this kind of object, the projections y are independent of z, so we can identify 
the ID projection Y (x) to y{x, z) , Vz. So as to be able to use the previous results, 
we must search an expression relating the kernel applied to 3^ to a ID kernel 
denoted H (that will be convolved with Y) that verifies: 

ybiurix, z) = y * H^^{x, z)=Y^ H{x) , Vz (19) 

X,Z X 

This kernel H is known to be the Abel Transform [3] of and is given by: 






dy = AT[H^^] (x) 



( 20 ) 



with + v^) = H‘^^{u,v), \/{u,v) G 

With this new definition of the criteria to be minimized: 



x£T> 






g- >^proj^ 



■AT [H^^] (x)) +yuuri.x,.) 



H 



(21) 



the problem is then well posed. 

The results we have obtained with this technique are flagrant because, if the 
blur kernel is known, the reconstruction by model fitting is then exact, whereas 
classical techniques provide a very smooth reconstruction, often far from the 
object. An example is illustrated on next figure where our exact reconstruction 
is drawn in continuous lines and the reconstruction obtained by generalized 
inversion is in dotted. 




4.3 Application to 3D Spherical Objects 

In this case, the 2D data y are the projections of a spherical axisymmetrical 
object. They are then circular symmetrical, centered at the point (c, c) and Y 
can be defined by 3^(x, c). If we use here a property of the Hankel Transform [3] 
(denoted HT), we get: 



HT[yuuri;C)] {q)=HT 



y * 

x.z 



{q) = HT[Y] {q)xHT 



H2D 



(q) (22) 
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where is given by formula 4.2. This expression allows us to identify H: 



HT [H] {q) = HT 



H2D 



(9) 



(23) 



The problem is then well posed if we formulate the criteria as 



= E 

x£V 



ln|iLT-i(iJT 



xiLT 



H2D 



){^)]-ybi ur (^5 c) 



We demonstrate, thanks to relation 4.3, that the reconstruction is indeed 
achievable. But the processing of direct and inverse Hankel Transforms remains 
a delicate problem and extensively increases computation time. 



5 Conclusion 

In this paper, we have presented an original approach to the problem of tomo- 
graphic reconstruction of an axisymmetrical object from one view. First, we have 
developped a ID study where we deform a simple model of the object based on 
a description in density areas. We have described the formal aspects of the re- 
construction and proposed two efficient regularizations allowing to minimize the 
derived energy by gradient descent under inequality constraints. We have also 
studied the bias generated by the noise on projections ; moreover, we have pro- 
posed a new formulation of the problem that enables us to deblur the projections 
during the reconstruction by fitting. In each case, we have compared our results 
to a reconstruction with generalized inversion ; we have obtained an important 
improvement in precision on the characteristic parameters we are looking for. 

Our future works deal with the warping of a fully 3D axisymmetrical model of 
the objects. We are now working on the construction of smooth 3D density fields 
inserted between axisymmetrical surfaces under hypotheses of quasi linearity of 
the density. 
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Abstract. We present a novel variational approach to top-down image 
segmentation, which acconnts for significant projective transformations 
between a single prior image and the image to be segmented. The pro- 
posed segmentation process is conpled with reliable estimation of the 
transformation parameters, without using point correspondences. The 
prior shape is represented by a generalized cone that is based on the con- 
tonr of the reference object. Its unlevel sections correspond to possible 
instances of the visible contour under perspective distortion and scaling. 
We extend the Chan-Vese energy functional by adding a shape term. 
This term measures the distance between the currently estimated sec- 
tion of the generalized cone and the region bounded by the zero-crossing 
of the evolving level set function. Promising segmentation results are ob- 
tained for images of rotated, translated, corrupted and partly occluded 
objects. The recovered transformation parameters are compatible with 
the ground truth. 



1 Introduction 

Classical methods for object segmentation and boundary determination rely on 
local image features such as gray level values or image gradients. However, when 
the image to segment is noisy or taken under poor illumination conditions, purely 
local algorithms are inadequate. Global features, such as contour length and 
piecewise smoothness [16], can be incorporated using a variational segmentation 
framework, see [1] and references therein. The handling of contours is facilitated 
by the level set approach [17]. In the presence of occlusion, shadows and low 
image contrast, prior knowledge on the shape of interest is necessary [20]. The 
recovered object boundary should then be compatible with the expected con- 
tour, in addition to being constrained by length, smoothness and fidelity to the 
observed image. 

The main difficulty in the integration of prior information into the variational 
segmentation process is the need to account for possible pose transformations 
between the known contour of the given object instance and the boundary in the 
image to be segmented. Many algorithms [4,6,5,14,19,13] use a comprehensive 
training set to account for small deformations. These methods employ various 
statistical approaches to characterize the probability distribution of the shapes. 
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They then measure the similarity between the evolving object boundary (or level 
set function) and representatives of the training data. The performance of these 
methods depends on the size and coverage of the training set. Furthermore, none 
of the existing methods accommodates perspective transformations in measur- 
ing the distance between the known instance of the object and the currently 
segmented image. 

We suggest a new method which employs a single prior image and accounts for 
significant projective transformations within a variational segmentation frame- 
work. This is made possible by two main novelties: the special form of the shape 
prior, and the integration of the projective transformations via unleveled sec- 
tions. These allow concurrent segmentation and explicit recovery of projective 
transformation in a reliable way. Neither point correspondence nor direct meth- 
ods [12] are used. The prior knowledge is represented by a generalized cone, 
which is constructed based on the known instance of the object contour. When 
the center of projection of a camera coincides with the vertex of the generalized 
cone, we are able to model the effects of the scene geometry. 

We use an extension of the Chan-Vese functional [3] to integrate image data 
constraints with geometric shape knowledge. The level set function and the pro- 
jective transformation parameters are estimated in alternation by minimization 
of the energy functional. The additional energy term that accounts for prior 
knowledge is a distance measure between a planar (not necessarily horizontal) 
section of the generalized cone and the zero-crossing of the evolving level set func- 
tion. Correct segmentation of partly occluded and corrupted images is demon- 
strated based on a prior image taken with different perspective distortion. The 
transformation parameters are recovered as well and are in good agreement with 
the ground truth. 



2 Unlevel-Sets 

2.1 Previous Framework 

Mumford and Shah [16] proposed to segment an input image /: i? — >■ M by 
minimizing the functional 

E{u,C) = ]~ [ {f -uYdxdy + \\ j \V u\^ dxdy + v\C\ , (1) 

^ J n ^ Jn-c 

simultaneously with respect to the segmenting boundary C and the piecewise 
smooth approximation u, of the input image /. 

When the weight A of the smoothness term tends to infinity, u becomes a 
piecewise constant approximation, u = {ui}, of /. We proceed with 

if(u, C) = i^ f {f — Ui)"^dxdy + i>\C\ = 17^ fi = 0 (2) 

2 , Joi 

In the two phase case, Chan and Vese [3] used a level-set function (/> G 
to embed the contour C = {x G ^2\ <j){x) = 0}, and introduced the Heaviside 
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function H((j)) into the energy functional: 

Ecv{(l),u+,U-)= f [{f -u+fH{(j>)+{f -u-f {I- H {(/)))+ v\VH{(j))\]dxdy 
Jn 

( 3 ) 

where 

= othetle W 

Using Euler-Lagrange equations for the functional (3), the following gradient 
descent equation for the evolution of (j) is obtained: 

^ = S{(j)) div (|^) - (/-u+)2 + (/-m_)2 . (5) 

A smooth approximation of (and S{4>)) must be used in practice [3]. The 
scalars and U- are updated in alternation with the level set evolution to take 
the mean value of the input image / in the regions </> > 0 and (j) < 0, respectively: 

^ J f{x,y)H{(j))dxdy ^ J f{x,y){l - H{(j)))dxdy 

J H{(j})dxdy — H{(j}))dxdy 



2.2 Shape Prior 

The energetic formulation (3) can be extended by adding a prior shape term [7]: 

E{(j>,u+,u-) = Ecv(4>,u+,U-) + y,Eshape{(l)), y>0- (7) 

We present two novel contributions to this framework. One is a reformulation 
of the distance measure between the prior and the evolving level-set function, 
outlined, in a preliminary form, in the rest of this subsection and finalized in 
subsection 2.5. The other is our unique way of embedding the prior contour 
within the energy functional, motivated in subsections 2. 3-2. 4, and formulated 
in subsection 2.5. 

Initially, the shape-term we incorporate in the energy functional measures 
the non-overlapping areas between the prior shape and the evolving shape. Let 
0 be the level set function embedding a prior shape contour. Then 

Eshape{(t>) = J [H{(j){x,y)) - H{^{x,y))^ dxdy ( 8 ) 

Note that we do not enforce the evolving level set function (j) to resemble (j>, 
instead we demand similarity of the regions within the respective contours. Min- 
imizing this functional with respect to (j) leads to the following evolution equation: 

^ = 5{4>) V diy {f -u+f + {f -u-f -2y{H{4>) - H{^)^ (9) 

This shape-term is adequate when the prior and segmented shapes are not sub- 
ject to different perspective distortions. Otherwise, the shape-term should incor- 
porate the projective transformation, as detailed in subsections 2. 5-2. 6. However, 
a few key concepts should be introduced first. 




Unlevel-Sets: Geometry and Prior-Based Segmentation 



53 




Fig. 1. The cone of rays with vertex at the camera center. An image is obtained by 
intersection of this cone with a plane. A ray between a 3D scene point P and the 
camera center CC pierces the plane in the image points p € f and p' G /. All such 
image points are related by planar homography, p' = Hpp. See [11]. 



2.3 Plane to Plane Projectivity 

An object in a 3D space and a camera center define a set of rays, and an image 
is obtained by intersecting these rays with a plane. Often this set is referred 
to as a cone of rays, even though it is not a cone in the classical sense. Now, 
suppose that this cone of rays is intersected by two planes, as shown in Fig. 1. 
Then, there exists a perspective transformation H mapping one image onto the 
other. This means that the images obtained by the same camera center may be 
mapped to one another by a plane projective transformation [8,11,9]. 

Let / and /' be the first and the second image planes, respectively. Let K 
denote a 3 x 3 internal calibration matrix. Consider two corresponding points, 
p € f and p' € f', expressed in homogeneous coordinates, which are two distinct 
images of the 3D object point P = {X, Y, Z), taken with the same camera. Their 
relation can be described by p' = KRK~^p+ ^Kt. i? is a 3 x 3 rotation matrix 
and t = [tx, ty, tz] is a translation vector. Thus, for any given K, the homography 
matrix Hp, such that p' = Hpp, can be recovered simply by estimating the values 
of R and t. Since only the plane transformation is important for the segmentation 
process, when the camera internal parameters are not known, K can be set to 
the identity matrix, implying that the optical axis is normal to the image plane 
/ and the focal length is 1. 



2.4 Generalized Cone 

A generalized cone^ or a conical surface, is a ruled surface generated by a moving 
line (the generator) that passes through a fixed point (the vertex) and contin- 
ually intersects a fixed planar curve (the directrix). Let Pp = (Xy,Yy, Zyertex) 
denote the cone vertex, and let py = (xy,yy) be the projection of the vertex on 
the directrix plane. We set, without loss of generality, Xy = Xy and Yy = 

^ The concept of generalized cone (or cylinder) in computer vision has been intro- 
duced to model 3D objects [2,15]. Its geometrical properties have been intensively 
investigated, see [10,18] and references therein. 
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Now, consider a directrix, C = p{s) = (x(s),y(s)) which is a closed contour, 
parameterized by arc-length s, of an object shape in the plane Z = Zpiane = 0. 
The generalized cone surface is the ruled surface defined by: 

<p[r, S) = ^((1 - r)p{s) + rpy) = (1 - r)Zpiane + rZyertex (10) 

where r varies smoothly from 1, that corresponds to the vertex, via 0, the direc- 
trix, to some convenient negative value. 

When the vertex of the generalized cone is located at the camera center, 
the definition of the generalized cone coincides with that of the cone of rays, 
presented in subsection 2.3. It follows that by planar slicing of the generalized 
cone, one can generate new image views as though they had been taken with a 
camera under the perspective model. There is, however, one exception to this 
analogy. The intersection of a cone and a plane is either a closed curve, an open 
curve or a point. In projective geometry terminology, the latter two correspond to 
projection of finite points in the first image plane to infinity. We do not consider 
ideal points and planes at infinity. Phrasing it explicitly, our only concern is the 
mapping of a given closed curve to another closed curve. 

2.5 Reformulation of the Energy Functional 

The shape-term in the energy functional (7) is now extended to account for pro- 
jective transformations. The evolution of the level-set function, given the prior 
contour and an estimate of the pose parameters, is considered in this subsec- 
tion. The recovery of the pose parameters, given the prior contour and the curve 
generated by the zero-crossing of the estimated level-set function, is described 
in subsection 2.6. 

Following subsection 2.2, (j) embeds the prior contour. For reasons that will 
soon be explained, it is referred to as the unlevel-set function and will take the 
form of a generalized cone. Let C = {x, y\4>{x, y) = 0} be the prior contour in /, 
and let Tp be a pose transformation applied to the unlevel-set function (f: 

(x',y',Tp{^) )'^ = R{x,y,^)'^ -\-t . (11) 

The evolving contour in the image to be segmented f' is iteratively compared 
with C' = {x' , y' \Tp{(p) = 0} which is the zero-crossing of the transformed 
unlevel-set function. Note, that instead of changing the pose of the intersecting 
plane and maintaining the generalized cone fixed, we rotate the generalized cone 
around its vertex and translate it, while keeping the intersecting plane fixed. 
Next, we apply the Heaviside function to the transformed unlevel-set function. 
Thus, the shape-term of the energy functional (7) becomes 

( 12 ) 



H{Tpm) 



Eshape{4>) = (h{ 4>) - H{Tp 



dxdy 



and the gradient descent equation, derived similarly to (9), is 



d(j} 

Ih 



= m 



V div ( 1 ^) - (/ - u+Y + (/ - u-f - 2p 



( 13 ) 
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Fig. 2. (a) A generalized cone is sliced by three planes, &X, Z = 
Z = — 0.3.(b) The resulting intersections, (c) A generalized cone 
an inclined plane: ax + by -\- cz -\- d = Q. (d) The resulting contour. 




0.3, Z = 0 and 
is intersected by 



2.6 Recovery of the Transformation Parameters 

In order to solve (13), one has to evaluate 4> simultaneously with the recovery of 
the transformation Tp of the unlevel-set function (j). The transformation param- 
eters are evaluated via the gradient descent equations obtained by minimizing 
the energy functional (12) with respect to each parameter. We demonstrate this 
for the special cases of pure translation and rotation. 



Translation Translation of an image plane along the principal axis results 
in scaling: As the planar section of the generalized cone is closer to the vertex, 
the cross-section shape is smaller, see Figs. 2a-b. Thus, a scale factor can be in- 
corporated into the energy functional, in compatibility with the scene geometry, 
simply by translation. Equivalently, one can move the generalized cone along 
the principal axis, while the plane remains stationary at Z = 0. In the case of 
pure scaling, Tp{4>) is reduced to <p + Substituting this expression into the 
shape-term (12) of the energy functional, and minimizing with respect to tz, 
gives the following equation: 

3t f ~ ~ 

^ = 2/i / S{(j) + tz){H{(j)) - H{(j) + tz))dxdy (14) 

dt Jn 

To account for general translation t = (tx,ty,tz)'^ , we can substitute the ex- 
pression for Tp{(j)) (11) in (12), with R = I, where I is the identity matrix. The 
shape term takes the form 

Eshape{4>)= / {H{(j)){x,y)- H{^{x + t:z,y + ty)+tz)Ydxdy 
Jn 

and the translation parameters tx and ty can be recovered similarly to tz- 



Rotation Consider a tilted planar cut of the generalized cone, as shown in 
Figs. 2c, d. The resulting contour is perspectively deformed, as a function of 
the inclination of the intersecting plane and its proximity to the vertex of the 
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cone. Equivalently, one may rotate the generalized cone around its vertex, and 
zero-cross to get the same perspective transformation. 

Any rotation can be decomposed to rotations about the three axes (Euler’s 
rotation theorem), and can be represented by a matrix R = Rx{a)RY{l 3 )Rz{l) 
operating on a vector (x,y,z)'^: 



'x'' 




'1 0 


0 




cosP 


0 


—sinP 




cos^ sin'y 0 




X 


y' 


= 


0 cosa 


sina 




0 


1 


0 




—siny cosy 0 




y 


z' 




0 —sina 


cosa 




sinP 


0 


cosP 




0 0 1 




z 



Let rj be some rotation angle corresponding to any of the angles a, l 3 or 7. The 
general gradient descent equation for a rotation angle is of the form: 



dr] 




SiTM) 



dz' dx' 
dx' dr] 



dz' dy' 
dy' dr] 



d^ 

dr] 



Note that ^ = (j){x,y) and z' = Tp((j)). The partial derivatives for 7 
example, are 



dxdy 

(15) 
= P, for 



dx' 

dy' 

dp 

dz' 



—X cosP sin'y — y sinP sin'y — z cosP 
x sina cosP cos^ + y sina cosP sin'y — z sina 
X cosa cosP cos^ + y cosa cosP sin'y — z cosa 



(16) 



and similarly for i] = a and 77 = 7. The values of dz' jdx' and dz' /dy' are derived 
numerically from the cone surface values. 

2.7 The Unlevel-Set Algorithm 

We summarize the proposed algorithm, for concurrent image segmentation given 
a prior contour, and recovery of the projective transformation between the cur- 
rent and prior object instances. 

1. The inputs are two images / and /' of the same object, taken with the 
same camera, but under different viewing conditions. The boundary C of 
the object in / is known. The image f' has to be segmented. The image 
plane of the first image / is assumed to be perpendicular to the principal 
axis, at distance 1 from the camera center. The second image plane, of /', is 
tilted and shifted relative to the first one. 

2. Given the contour C, construct a generalized cone, using the expression in 

(10) with Z vertex — f- 

3. Choose some initial level-set function p, for example a standard right cone. 

4. Set initial values (e.g. zero) for a, /3 , 7 , tx, ty and G. 

5. Compute the average gray level values of the object and background pixels, 
u+ and U-, using equation (6). 
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Fig. 3. Synthetic example, (a) Prior image. The contonr is known (not shown), (b) 
Snccessful segmentation: the final contour is shown (black) on the transformed and 
corrupted image, (c) The final contour C obtained in (b). (d) Generalized cone based 
on the prior contour C. (e) Final level set function rfi. (f) Wrong segmentation: prior 
knowledge was not used, (g) The final contour obtained in (f). (h) Wrong segmentation: 
the prior is used without incorporated the projective transformation. 



6. Compute the values of Tp(^) according to equation (11), for the currently 
estimated transformation parameters. 

7. Update (j) according to the gradient descent equation (13). 

8. Update t, using (14) for tz and similar equations for and ty, and (15) for 
a, (3 and 7, until convergence. 

9. Repeat steps 5-8 until convergence. 

3 Experimental Results 

To demonstrate our model, we present segmentation results on various synthetic 
and real images. Relative scale and pose parameters between the image of the 
known contour and the image to be segmented have been estimated and com- 
pared to the ground-truth, where available. The strength of this algorithm is 
expressed by its weak sensitivity with respect to the parameters of the func- 
tional. We use V = 50, /i = 25 unless otherwise stated. Exclusion of the shape 
prior knowledge from the functional means setting ^ to zero. 

Consider the synthetic images shown in Figs. 3a,b. Only the contour of the 
object in Fig. 3a (not drawn) was known in advance and used as prior. The 
object in Fig. 3b was generated from Fig. 3a by rotation and translation with 
the following parameters: Rx{a) = 0.3'’, RyiP) = —0.3*’ and Rz{l) = 60'’ with 
scale factor of 0.9. It has also been broken and lightened. Note the significant 
perspective distortion despite the fairly small rotations around the X and Y 
axes. The black contour in Fig. 3b is the result of the segmentation process. For 
clarity, the final contour is presented by itself in Fig. 3c. The generalized cone 
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Fig. 4. Real image with synthetic transformation, (a) Prior image. The contour is 
known (not shown), (b) Successful segmentation: the hnal contour (black) on the trans- 
formed image, (c) The final contour C obtained in (b). (d) Generalized cone (f>, based 
on the prior contour C. (e) The hnal level set function (j>- (f) Wrong segmentation: 
prior knowledge was not used, (g) The hnal contours obtained in (f). (h) Wrong seg- 
mentation: the prior is used without incorporating the projective transformation. 



4> that was constructed, based on the known image contour, using Eq. (10), is 
shown in Fig. 3d. Fig. 3e shows the final evolving level-set function cj). It is worth 
emphasizing that (f) and Tp((p) resemble in terms of their Heaviside functions - 
that is by their zero-crossings (the final contour) , but not in their entire shapes. 
The estimated transformation parameters are: Rx{o) = 0.38°, Ry{P) = —0.4°, 
Rz{l) = 56.6° and = —0.107 - which corresponds to scaling of 0.893. When 
no shape prior is used, each part of the broken heart is segmented separately 
(Figs. 3f-g). Segmentation fails when the prior is enforced without recovery of 
the transformation parameters, as shown in figure 3h. 

We next consider real images. Figs. 4a-b, where the black contour around the 
object in figure 4b is again the segmentation result. The final contour itself is 
shown in Fig. 4c. The transformation between the images was synthetic, so that 
the calculated parameters could be compared with the ground-truth. The trans- 
formation parameters are: Rx{ct) = —0.075°, Ry{/3) = 0.075° and Rz{l) = 9° 
with scaling factor of 0.8 . Compare with the recovered transformation parame- 
ters: Rx{<y) = —0.063°, Ry{/3) = 0.074°, Rz{l) = 7.9° and scaling of 0.81. The 
generalized cone 4>, based on the given jar contour, and the final level set func- 
tion (j) are shown in Figs. 4d-e respectively. The jar shown is black with white 
background. Thus, without using the prior, the bright specular reflection spots 
spoil the segmentation, as shown in Figs. 4f-g. Again, when the prior is enforced. 
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Fig. 5. Real image with synthetic noise, (a) Prior image. The contour is known (not 
shown), (b) Successful segmentation: the final contour (black) on the transformed im- 
age. (c) The final contour C obtained in (b). (d) Generalized cone (jj, based on the prior 
contour C. (e) Final level set function (j>. (f) Wrong segmentation: prior knowledge was 
not used, (g) The final contours obtained in (f). (h) Wrong segmentation: the prior is 
used without incorporating the projective transformation. 



but the transformation parameters are not recovered, segmentation fails as seen 
in Fig. 4h. 

To check simultaneous translations along the X, Y and Z axes we applied 
our algorithm to the images shown in Figs. 5a-b. The noisy Fig. 5b is segmented 
correctly (black contour) in spite of the significant translation with respect to 
the prior. No preprocessing alignment has been performed. The functional pa- 
rameters in this case were /r = 13 and zz = 40. The recovered transformation 
parameters are: tx = 19.54, ty = —18.8, = 0.08. 

Finally, we demonstrate the method using a real object (mannequin head), 
which has actually been rotated, moved and occluded, as seen in Figs. 6a-c. 
The algorithm is able to segment the head precisely, in spite of the covering 
hat which has color similar to that of the mannequin. The segmenting contour 
accurately traces the profile of the mannequin, despite the significant trans- 
formation. Since the actual transformation was not measured, then in order to 
confirm the recovered transformation parameters. Fig. 6e shows the zero-crossing 
of the transformed generalized cone together with the final segmenting contour 
(Fig. 6d). 

Translation and rotation of non-planar objects may reveal previously hidden 
points and hide others. Therefore, the visible contour in a new instance of the 
object might be significantly different from the reference. However, as seen in the 
jar and mannequin examples, for moderate transformations of these non-planar 
objects, promising segmentation results are obtained. 
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(a) 




(b) 




(c) 




(d) 




Fig. 6. Real example, (a) Reference image (mannequin head). The contour is known 
(not shown), (b) New instance of the mannequin head, rotated and translated, (c) Suc- 
cessful segmentation: the final contour (black) on the transformed mannequin head. 
The segmentation is precise despite the covering hat. (d) The final contour C ob- 
tained in (b). (e) The final contour as in (d), drawn on the Heaviside function of the 
transformed generalized cone: H{Tp{<j>)). This shows the compatibility between the cal- 
culated and actual transformation parameters, (f) Final shape of the evolving level set 
function (f>. (g) Final contour obtained without using a shape prior, (h) Final contour 
obtained using the prior but without recovery of the transformation parameters. 



4 Discussion 

Detection of an object in a corrupted image, based on a reference image taken 
with from a different view-point, is a classical challenge in computer vision. 
This paper presents a novel approach that makes substantial progress towards 
this goal. The key to this accomplishment is the unique integration of scene 
geometry with the variational approach to segmentation. The reference shape is 
the foundation of a generalized cone. In principle, the zero level set of an evolving 
function, related to the image features, is matched with unlevel sections of the 
generalized cone that correspond to projectively deformed views of the shape. 

The suggested algorithm successfully accounts for scale and pose variations 
under the perspective model, including rotation outside the image plane, without 
using point correspondence. The algorithm converges empirically even for fairly 
large transformations and significantly corrupted images. Promising segmenta- 
tion results and accurate numerical estimation of the transformation parameters, 
suggest this model as an efficient tool for segmentation and image alignment. 
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Abstract. We present a novel algorithm for extracting shapes of con- 
tours of (possibly partially occluded) objects from noisy or low-contrast 
images. The approach taken is Bayesian: we adopt a region-based model 
that incorporates prior knowledge of specific shapes of interest. To quan- 
tify this prior knowledge, we address the problem of learning probability 
models for collections of observed shapes. Our method is based on the 
geometric representation and algorithmic analysis of planar shapes in- 
troduced and developed in [15]. In contrast with the commonly used 
approach to active contours using partial differential equation methods 
[12,20,1], we model the dynamics of contours on vector fields on shape 
manifolds. 



1 Introduction 

The recognition and classification of objects present in images is an important 
and difficult problem in image analysis. Applications of shape extraction for ob- 
ject recognition include video surveillance, biometrics, military target recogni- 
tion, and medical imaging. The problem is particularly challenging when objects 
of interest are partially obscured in low-contrast or noisy images. Imaged ob- 
jects can be analyzed in many ways: according to their colors, textures, shapes, 
and other characteristics. The past decade has seen many advances in the in- 
vestigation of models of pixel values, however, these methods have only found 
limited success in the recognition of imaged objects. Variational and level-set 
methods have been successfully applied to a variety of segmentation, denoising, 
and inpainting problems (see e.g. [1]), but significant advances are still needed 
to satisfactorily address recognition and classification problems, especially in 
applications that require real-time processing. 

An emerging viewpoint among vision researchers is that global features such 
as shapes should be taken into account. The idea is that by incorporating some 
prior knowledge of shapes of objects of interest to image models, one should be 
able to devise more robust and efficient image analysis algorithms. Combined 
with clustering techniques for the hierarchical organization of large databases 
of shapes [22] , this should lead to recognition and classification algorithms with 
enhanced speed and performance. In this paper, we construct probability models 
on shape spaces to model a given collection of observed shapes, and integrate 
these to a region-based image model for Bayesian extractions of shapes from 
images. Our primary goal is to capture just enough information about shapes 
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present in images to be able to identify them as belonging to certain categories 
of objects known a priori, not to extract fine details of the contours of imaged 
objects. 

Shape analysis has been an important theme of investigation for many years. 
Following the seminal work of Kendall [13], large part of the research in quan- 
titative shape analysis has been devoted to “landmark-based” studies, where 
shapes are represented by finite samplings of contours. One establishes equiva- 
lences of representations with respect to shape preserving transformations, and 
then compares shapes in the resulting quotient space [5,21]. Statistical shape 
models based on this representation have been developed and applied to image 
segmentation and shape learning in [7,6]; the literature on applications of this 
methodology to a variety of problems is quite extensive. A drawback of this ap- 
proach is that the automatic selection of landmarks is not straightforward and 
the ensuing shape analysis is heavily dependent on the choices made. Grenan- 
der’s deformable templates [8] avoids landmarks by treating shapes as points in 
an infinite-dimensional differentiable manifold, and modeling variations of pla- 
nar shapes on an action of the diffeomorphism group of [24,9,18]. However, 
computational costs associated with this approach are typically very high. A 
very active line of research in image analysis is based on active contours [12,20] 
governed by partial differential equations; we refer the reader to [1] for a recent 
survey on applications of level-set methods to image analysis. Efforts in the di- 
rection of studying shape statistics using partial differential equation methods 
have been undertaken in [17,3,2]. 

In [15], Klassen et al. introduced a new framework for the representation and 
algorithmic analysis of continuous planar shapes, without resorting to defining 
landmarks or diffeomorphisms of . To quantify shape dissimilarities and simu- 
late optimal deformations of shapes, an algorithm was developed for computing 
geodesic paths in shape spaces. The registration of curves to be compared is au- 
tomatic, and the treatment suggests a new technique for driving active contours 
[23]. In this paper, we investigate variants of this model for shape extraction from 
images. In our formulation, the dynamics of active contours is governed by vector 
fields on shape manifolds, which can be integrated with classical techniques and 
reduced computational costs. The basic idea is to create a manifold of shapes, 
define an appropriate Riemannian structure on it, and exploit its geometry to 
solve optimization and inference problems. 

An important element in this stochastic geometry approach to shape extrac- 
tion is a model for shape learning. Assuming that a given collection of observed 
shapes consists of random samples from a common probability model, we wish to 
learn the model. Examples illustrating the use of landmark-based shape analysis 
in problems of this nature are presented in [7,6,14,10]. The problem of model 
construction using the shape analysis methods of [15] presents two main diffi- 
culties: the shape manifold is nonlinear and infinite- dimensional. A most basic 
notion needed in the study of sample statistics is that of mean shape; Karcher 
means introduced in [11] are used. As in [5], other issues involving nonlinearity 
are handled by considering probability densities on the (linear) tangent space 
at the mean shape. To tackle the infinite dimensionality, we use approximate 
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finite-dimensional representations of tangent vectors to shape manifolds. We 
consider multivariate normal models, so that learning reduces to estimations of 
the relevant parameters. (Other parametric models can be treated with simi- 
lar techniques.) Implicit in this approach is that large collection of shapes have 
been pre-clustered and we are modeling clusters of fairly small diameters. Clus- 
tering algorithms and hierarchical organizations of large databases of shapes are 
discussed in [22]. 

This paper is organized as follows: in Section 2, we briefly review the material 
of [15], as it provides the foundations of our stochastic geometry approach to 
shape extraction. Section 3 is devoted to a discussion of shape learning. In Section 
4, we present the image model used in the shape extraction algorithm, and 
applications of the algorithm to imagery involving partial occlusions of objects, 
low contrast, or noise. 



2 Shape Spaces and Geodesic Metrics 



In this section, we review the geometric representation of continuous planar 
shapes, the geodesic metric on shape space, and the algorithmic shape analysis 
methods introduced and developed in [15]. 



2.1 Geometric Representation of Shapes 

Shapes of outer contours of imaged objects are viewed as closed, planar curves 
a: / — >■ where I = [0,27 t]. To make shape representations invariant to uni- 

form scaling, the length is fixed to be 27 t by requiring that curves be param- 
eterized by arc length, i.e., |[o;^(s)|| = 1, for every s € I. Then, the tangent 
vector can be written as a'(s) = where j = V— 1. We refer to 6*: / — >■ M 

as an angle function for a. Angle functions are invariant under translations of 
and the effect of a rotation is to add a constant to 0. Thus, to make the 
representation invariant to rotations of , it suffices to fix the average of 6 to 
be, say, tt. In addition, to ensure that 9 represents a closed curve, the condition 
a'{s) ds = ds = 0 is imposed. Thus, angle functions are restricted 

to the pre-shape manifold 

C = G I ^ J 9{s) ds = TT and J = o| . (1) 

Here, denotes the vector space of all square integrable functions on [0, 27 t], 
equipped with the standard inner product (/,(?) = f{s)g{s) ds. For contin- 
uous direction functions, the only remaining variability in the representation is 
due to the action of the reparametrization group arising from different possi- 
ble placements of the initial point s = 0 on the curve. Hence, the quotient space 
§ = C/S^ is defined as the space of continuous, planar shapes. 
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2.2 Geodesic Paths between Shapes 

At each point 0 G C, the tangent space Tg C to the pre-shape manifold C C 
naturally inherits an inner product from L^. Thus, 6 is a Riemannian mani- 
fold and the distance between points in C can be defined using minimal length 
geodesics. The distance between two points (i.e., shapes) 9i and 02 in § is defined 
as the infimum of all pairwise distances between pre-shapes representing 9i and 
02, respectively. Thus, the distance d{9i,02) in 8 is realized by a shortest geodesic 
in C between pre-shapes associated with 6*i and 02- We abuse terminology and 
use the same symbol 9 to denote both a pre-shape and its associated shape in 
8. We also refer to minimal geodesics in C realizing distances in 8 as geodesics 
in 8, and to tangent vectors to these geodesics as tangent vectors to 8. 

One of the main results of [15] is the derivation of an algorithm to compute 
geodesics in 6 (and 8) connecting two given points. An easier problem is the 
calculation of geodesics satisfying prescribed initial conditions. Given 0 G C 
and / G TgC, let ^{0,f,t) denote the geodesic starting at 9 with velocity /, 
where t denotes the time parameter. The geodesic 'F{0,f,t) is constructed with 
a numerical integration of the differential equation satisfied by geodesics. The 
correspondence / >->■ 'P{0,f,l) defines a map expg : Tg C — >■ C known as the 
exponential map at 0. The exponential map simply evaluates the position of 
the geodesic W at time t = 1. Consider the exponential map at 9i. Finding 
the geodesic from 0i to 02 is equivalent to finding the direction / such that 
expg^(/) = 02- For each / G Tg^ C, let E{f) = || expg^(f) — be the square 
of the norm of the residual vector. The goal is to find the vector / that 
minimizes (i.e., annihilates) E. A gradient search is used in [15] to solve this 
energy minimization problem. This procedure can be refined to yield geodesics 
in 8 by incorporating the action of the re-parametrization group into the 
search. 

Figure 1 shows an example of a geodesic path in 8 computed with this al- 
gorithm. In this paper, we have added the invariance of shapes to reflections in 
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Fig. 1. A geodesic path in shape space 



Essentially, one computes geodesics for both a shape and a reflection, and 
selects the one with least length. 

2.3 Karcher Mean Shapes 

The use of Karcher means to define mean shapes in 8 is suggested in [15]. If 
01, - ■ ■ ,0n G 8 and d{0, 9j) is the geodesic distance between 9 and 0j, a Karcher 
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mean is defined as an element /i G § that minimizes the quantity ^*)- 

An iterative algorithm for computing Karcher means in Riemannian manifolds is 
presented in [16,11] and particularized to the spaces 6 and 8 in [15]. An example 
of a Karcher mean shape is shown in Figure 2. 




Fig. 2. The Karcher mean shape of eight boots. 



2.4 Computational Speeds 

To demonstrate the level of performance of the algorithm to compute geodesics 
between two shapes, Table 2.4 shows the average computation times achieved 
under several different settings, each estimated by averaging 1,225 calculations 
on a personal computer with dual Xeon CPUs (at 2.20 GHz) running Linux. 
Each shape is sampled using T points on the curve and tangent vectors are 
approximated using 2m Fourier terms. Consistent with our analysis, the algo- 



Table 1. Average computation time (in seconds) per geodesic. 
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50 


50 


100 


100 


200 


200 


400 


400 


m 


50 


100 


100 


200 


200 


400 


400 


800 


Time (secs.) 


0.0068 


0.0133 


0.0268 


0.0525 


0.1044 


0.2066 


0.4172 


0.8274 



rithm for calculating geodesics is linear in T and m. Computational efficiency 
can be further improved with parallel processing, since the costliest step in the 
algorithm consists of 2m calculations that can be executed independently. 

3 Shape Learning 

An important problem in statistical shape analysis is to “learn” probability 
models for a collection of observed shapes. Assuming that the given shapes 
are random samples from the same probability model, we wish to learn the 
model. These models can then be used as shape priors in Bayesian inferences 
to recognize or classify newly observed shapes. Implicit in our considerations is 
the assumption that observed shapes have been pre-clustered, so that we are 
seeking probability models for clusters of fairly small diameters in 8. Clustering 
techniques on the shape space 8 have been studied in [22] . 
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Learning a probability model amounts to estimating a probability density 
function on shape space, a task that is rather difficult to perform precisely. In this 
paper, we assume a parametric form for the models so that learning is reduced 
to an estimation of the relevant parameters. To simplify the discussion of proba- 
bility models on infinite-dimensional manifolds, the models will be presented in 
terms of their negative log-likelihood, i.e., the energy of the distribution. 

The simplest model is a “uniform Gaussian” on §, whose energy is propor- 
tional to cP{9, /r)/2, where /i is the Karcher mean of the sample. The constant of 
proportionality is related to the variance, as usual. We wish to refine the model 
to a multivariate normal distribution. Two main difficulties encountered are the 
nonlinearity and the infinite- dimensionality of §, which are addressed as follows. 



(i) Nonlinearity. Since § is a nonlinear space, we consider probability distri- 
butions on the tangent space C at the mean pre-shape n € C, to avoid 
dealing with the nonlinearity of § directly. This is similar to the approach 
taken in [5]. 

(ii) Dimensionality. Our parametric models will require estimations of covari- 
ance operators of probability distributions on T^C C L^. We approximate 
covariances by an operators defined on finite dimensional subspaces of T^C. 

Let 0 = { 6 * 1 , . . . , 9r} represent a finite collection of shapes. The estimation 
of the Karcher mean shape of 6 > is described in [15]. Using /i and the shapes 
9j, ^ < j < r, we find tangent vectors gj G § such that the geodesic from g, 
in the direction gj reaches 9j in unit time, that is, exp^(vj) = 9j. This lifts the 
shape representatives to the tangent space at g. 

Let V be the subspace of spanned by {rii,... ,Vr}, and {ei,... ,em} 
an orthonormal basis of V. Given v G V, write it as u = XiCi -|- . . . -I- XmOm- 
The correspondence u i— >■ x = (a;i, . . . ,Xm) identifies V with K"*, so we assume 
that Vj G K™. We still have to decide what model to adopt for the probability 
distribution. We assume a multivariate Gaussian model for x with mean 0 and 
covariance matrix K G The estimation of K using sample covariance 

follows the usual procedures. Depending on the number and the nature of the 
shape observations, the rank of K may be much smaller than m. Extracting the 
dominant eigenvectors and eigenvalues of the estimated covariance matrix, one 
captures the dominant modes of variation and the variances along these principal 
directions. 

To allow small shape variations in directions orthogonal to those determined 
by the non-zero eigenvalues of K, choose £ > 0 somewhat smaller than the 
dominant eigenvalues of K. If Kg = K -\- where Im is the m x m identity 
matrix, we adopt the multivariate normal distribution 
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on the subspace V of T^§. If 6* G 8, let g G T^S satisfy 'P{g,g, 1) = 9, and let 
gv = be the orthogonal projection of g onto V. We adopt a probability 
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model on § whose energy is given up to an additive constant by 

F{9- K) = ^ \\g^r , (3) 

where g-^ G T^§ is the component of g orthogonal to V. Strictly speaking, 
this definition is only well posed if the exponential map is globally one-to-one. 
However, for most practical purposes, one can assume that this condition is 
essentially satisfied because clusters are assumed to be concentrated near the 
mean and finite-dimensional approximations to 9 are used. 

The first row of Figure 3 shows eigenshapes associated with the first five 
eigenvalues (in decreasing order) of the multivariate normal model derived from 
the shapes in Figure 2. The solid lines show the mean shape, and the dotted 
lines represent variations about the mean along principal directions. Variations 
are uniformly sampled on an interval of size proportional to the eigenvalues. 

Having obtained a probability model for observed shapes, an important task 
is to validate it. This can be done in a number of ways. As an illustration, we 
use the model for random sampling. The second and third rows of Figure 3 show 
examples of random shapes generated using the Gaussian model learned from 
the shapes in Figure 2. 




Fig. 3. The first row shows eigenboots in dotted lines, i.e., variations about the mean 
shape of Figure 2 (displayed in solid lines) along principal directions associated with 
the five dominant eigenvalues. The second and third rows display 18 random shapes 
sampled from the proposed multivariate normal model. 



Another example is shown in Figure 4. A set of nineteen observed shapes 
of swimming ducks is analyzed for learning a probability model. We calculated 
the mean shape - shown on the lower right corner - and estimated the sample 
covariance matrix K. Figure 5 shows variations of the mean shape along the 
dominant principal directions and ten random shapes generated using the learned 
probability model. 
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Fig. 4. Nineteen shapes of swimming ducks and their mean shape displayed on the 
lower right corner. 




Fig. 5. Eigenducks (first row) and ten random shapes generated by sampling from a 
multivariate normal tangent-space model. 



4 Bayesian Extraction of Shapes 

The extraction of shapes of partially occluded objects from noisy or low-contrast 
images is a difficult problem. Lack of clear data in such problems may severely 
limit the performance of image segmentation algorithms. Thus, techniques for 
integrating some additional knowledge about shapes of interest into the inference 
process are sought. The framework developed in this paper is well suited to the 
formulation and solution of Bayesian shape inference problems involving this 
type of imagery. We assume that the shape to be extracted is known a priori 
to be related to a family modeled on a probability distribution of the type 
discussed in Section 3. The case of several competing models can be treated with 
a combination of our shape extraction method and hypothesis testing techniques. 
We emphasize that our goal is to extract just enough features of shapes present in 
images to be able to recognize objects as belonging to certain known categories, 
not to capture minute details of shapes. Such low-resolution approach is more 
robust to noise and allows for greater computational efficiency. 

Our analysis thus far has focused on shape, a property that is independent of 
variables that account for rotations, translations, and scalings of objects. How- 
ever, shapes appear in images at specific locations and scales, so the process 
of shape extraction and recognition should involve an estimation of these nui- 
sance variables as well. Hence, in this context, the data likelihood term assumes 
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knowledge of shape, location, and scale variables, while the prior term depends 
only on shape. Therefore, we first revisit our shape representation to incorporate 
these extra variables. 

To account for translational effects, we introduce a variable p S that 
identifies the centroid of a constant-speed curve a: I — >■ which is given by 
p = (1/27t) Jg ^ q;(s) ds. We adopt a logarithmic scale for the length by writing 
L = e^, £ G K. Lastly, to allow arbitrary rotations, we simply relax the constraints 
on 9 used in the description of the pre-shape manifold C and only require that 
6 satisfy the closure conditions 

p 27T p27T 

/ cos9{s)ds = Q and / sin0(s)ds = O. (4) 

Jo Jo 

Thus, pre-shapes that can change position and are free to shrink or stretch 
uniformly will be described by triples (p, 0) € x K x satisfying (4) . The 

collection of all such triples will be denoted T. An element (p, £,9) G 3^ represents 
the curve 

pS e p2-!T pS 

a{s)=p+e^ / dx — — / / e^^^^^dxds. (5) 

Jo 27t Jq Jq 

For shape extraction, we do not need to further consider the quotient space 
under the action of the re-parameterization group on T. The data likelihood 
and shape prior terms will be invariant under the S^-action, so the posterior 
energy will be constant along orbits. We now describe the posterior energy 
for our Bayesian inference. 

(a) Data Likelihood. Let D C be the image domain and I: D ^ R+ be an 
image. A closed curve represented by (p, £, 9) divides the image domain into 
a region Di{p, £, 9) inside the curve, and a region Do{p, £, 9) outside. Let pi be 
a probability model for the pixel values inside the curve, and Po be a model 
for pixels outside. For simplicity, we assume a binary image model choosing 
Pi and Po to be Gaussian distributions with different means. (Alternatively, 
one can use variants of the Mumford-Shah image model [19]). For a given 
(p, £, 9 ) , the compatibility of an image / with (p, £, 9) is proportional to 

H{I\pJ,9) = - [ [ logpi{I{y))dy - ( ( log p o{I {y)) dy. 

JJ Di{p4,0) JJ Do(p,e,9) 

( 6 ) 

(b) Shape Prior. Let po smd Kq represent the mean and the covariance matrix 
associated with the shape prior model. Set the prior energy to be ^^([0] ; p, K), 
as in Equation 3, where \9] indicates that 9 has been normalized to have 
average tt. 

Combining the two terms, up to an additive constant, the posterior energy is 
proportional to 



Pa (p, £, 9\I) = XH{I\p, £, 0) + (1 - A)P([0]; p, K ) , 
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Fig. 6. An example of shape extraction. Top row from left to right: an image, the same 
image with a partial occlusion, and the prior mean. The bottom row shows MAP shape 
estimates with an increasing influence of the prior from left to right. 




t>\ t>\ t> 0 0 



Fig. 7. A shape extraction experiment. The top row shows an image of a skull to be 
analyzed after being artihcially obscured and a second skull whose contour is used as 
prior mean. The bottom row shows various stages of the curve evolution during the 
shape extraction. 



where 0 < A < 1. As before, it is convenient to lift 9 to the tangent space at the 
mean Letting 9 = 1) = exp^(^), with g in the tangent space at /t, we 

rewrite the posterior energy as 

E\{p,(.,g\I) = P\ (p,^,exp^(g)|/) . 

We use a gradient search for a MAP estimation of (p, g ) , approximating g with 
a truncated Fourier series. 

Shown in Figure 6 are illustrations of this Bayesian shape extraction using 
a uniform Gaussian prior. The top row shows an object embedded in an image, 
the same image with the object partially obscured, and the prior mean shape. 
The bottom row displays MAP estimates of the shape of the object under an 
increasing influence of the prior. The improvements in discovering hidden shapes 
despite partial occlusions emphasize the need and power of a Bayesian approach 
to such problems using shape priors. 

Figure 7 depicts the results of another shape extraction experiment. On the 
top row, the first panel displays the image of a skull that is artificially obscured 
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for shape extraction. The second panel shows the contour of a different skull 
used as prior mean. The second row shows various stages of the curve evolution 
during the gradient search of a MAP estimate of the shape. 

5 Summary 

We presented an algorithm for the extraction of shapes of partially obscured 
objects from noisy, low-contrast images for the recognition and classification 
of imaged objects. The image model adopted involves a data likelihood term 
based on pixel values and a shape prior term that makes the algorithm robust to 
image quality and partial occlusions. We discussed learning techniques in shape 
space in order to construct probability models for clusters of observed shapes 
using the framework for shape analysis developed in [15] A novel technique that 
models the dynamics of active contours on vector fields on shape manifolds was 
employed. Various shape extraction experiments were carried out to demonstrate 
the performance of the algorithm. 
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Abstract. We propose a variational framework for the integration mul- 
tiple competing shape priors into level set based segmentation schemes. 
By optimizing an appropriate cost functional with respect to both a level 
set function and a (vector- valued) labeling function, we jointly generate 
a segmentation (by the level set function) and a recognition-driven par- 
tition of the image domain (by the labeling function) which indicates 
where to enforce certain shape priors. Our framework fundamentally ex- 
tends previous work on shape priors in level set segmentation by directly 
addressing the central question of where to apply which prior. It allows 
for the seamless integration of numerous shape priors such that - while 
segmenting both multiple known and unknown objects - the level set 
process may selectively use specific shape knowledge for simultaneously 
enhancing segmentation and recognizing shape. 



1 Introduction 

Image segmentation and object recognition in vision are driven both by low- 
level cues such as intensities, color or texture properties, and by prior knowledge 
about objects in our environment. Modeling the interaction between such data- 
driven and model-based processes has become the focus of current research on 
image segmentation in the field of computer vision. In this work, we consider 
prior knowledge given by the shapes associated with a set of familiar objects and 
focus on the problem of how to exploit such knowledge for images containing 
multiple objects, some of which may be familiar, while others may be unfamiliar. 

Following their introduction as a means of front propagation [13], level set 
based contour representations have become a popular framework for image seg- 
mentation [1,10]. They permit to elegantly model topological changes of the 
implicitly represented boundary, which makes them well suited for segment- 
ing images containing multiple objects. Level set segmentation schemes can be 
formulated to exploit various low level cues such as edge information [10,2,8], 
intensity homogeneity [3,18], texture [14] or motion information [6]. In recent 
years, there has been much effort in trying to integrate prior shape knowledge 
into level set based segmentation. This was shown to make the segmentation 
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process robust to misleading low-level information caused by noise, background 
clutter or partial occlusion of an object of interest (cf. [9,17,5,15]). 

A key problem in this context is to ensure that prior knowledge is selectively 
applied at image locations only where image data indicate a familiar object. 
Conversely, lack of any evidence for the presence of some familiar object should 
result in a purely data-driven segmentation process. To this end, it was recently 
proposed to introduce a labeling function in order to restrict the effect of a given 
prior to a specific domain of the image plane [7] (for a use of a labeling field in a 
different context see [11]). During optimization, this labeling function evolves so 
as to select image regions where the given prior is applied. The resulting process 
segments corrupted versions of a known object in a way that does not affect the 
correct segmentation of other unfamiliar objects. A smoothness constraint on the 
labeling function induces the process to distinguish between occlusions (which 
are close to the familiar object) and separate independent objects (assumed to 
be sufficiently far from the object of interest). 

All of the approaches mentioned above were designed to segment a single 
known object in a given image. But what if there are several known objects? 
Clearly, any use of shape priors consistent with the philosophy of the level set 
method should retain the capacity of the resulting segmentation scheme to deal 
with multiple independent objects, no matter whether they are familiar or not. 
One may instead suggest to iteratively apply the segmentation scheme with a 
different prior at each time and thereby successively segment the respective ob- 
jects. We believe, however, that such a sequential processing mode will not scale 
up to large databases of objects and that - even more importantly - the paral- 
lel use of competing priors is essential for modeling the chicken-egg relationship 
between segmentation and recognition. 

In this paper, we adopt the selective shape prior approach suggested in [7] 
and substantially generalize it along several directions: 

— We extend the shape prior by pose parameters. The resulting segmentation 
process not only selects appropriate regions where to apply the prior, it 
also selects appropriate pose parameters associated with a given prior. This 
drastically increases the usefulness of this method for realistic segmentation 
problems, as one cannot expect to know the pose of the object beforehand. 

— We extend the previous approach which allowed one known shape in a scene 
of otherwise unfamiliar shapes to one which allows two different known 
shapes. Rather than treating the second shape as background, the segmen- 
tation scheme is capable of reconstructing both known objects. 

— Finally we treat the general case of an arbitrary number of known and un- 
known shapes by replacing the scalar-valued labeling by a vector-valued 
function. The latter permits to characterize up to 2" regions with differ- 
ent priors, where n is the dimension of the labeling function. In particular, 
we demonstrate that - through a process of competing priors - the result- 
ing segmentation scheme permits to simultaneously reconstruct three known 
objects while not affecting the segmentation of separate unknown objects. 
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In this work, the term shape prior refers to fixed templates with variable 
2D pose. However, the proposed framework of selective shape priors is easily 
extended to statistical shape models which would additionally allow certain de- 
formation modes of each template. For promising advances regarding level set 
based statistical shape representations, we refer to [4]. 

The outline of the paper is as follows: In Section 2, we briefly review the level 
set formulation of the piecewise constant Mumford-Shah functional proposed in 
[3]. In Section 3, we augment this variational framework by a labeling function 
which selectively imposes a given shape prior in a certain image region. In Section 
4, we enhance this prior by explicit pose parameters and demonstrate the effect 
of simultaneous pose optimization. In Section 5, we extend the labeling approach 
from the case of one known object and background to that of two independent 
known objects. In Section 6, we come to the central contribution of this work, 
namely the generalization to an arbitrary number of known and unknown objects 
by means of a vector-valued labeling function. We demonstrate that the resulting 
segmentation scheme is capable of reconstructing corrupted versions of multiple 
known objects displayed in a scene containing other unknown objects. 



2 Data-Driven Level Set Segmentation 

Level set representations of moving interfaces, introduced by Osher and Sethian 
[13], have become a popular framework for image segmentation. A contour C is 
represented as the zero level set of an embedding function : 12 — >■ K on the 
image domain 17 C 



C={xen\(j){x) = 0}. (1) 

During the segmentation process, this contour is propagated implicitly by evolv- 
ing the embedding function (p. In contrast to explicit parameterizations, one 
avoids the issues of control point regridding. Moreover, the implicitly represented 
contour can undergo topological changes such as splitting and merging during 
the evolution of the embedding function. This makes the level set formalism well 
suited for the segmentation of multiple objects. In this work, we will revert to a 
region-based level set scheme introduced by Chan and Vese [3]. However, other 
data-driven level set schemes could be employed. 

In [3] Chan and Vese introduce a level set formulation of the piecewise con- 
stant Mumford-Shah functional [12]. In particular, they propose to generate a 
segmentation of an input image / with two gray values /ii and /i 2 by minimizing 
the functional 

Ecv{h-i,lJ- 2 ,(l>) = J if - + {f - H{(f>)) dx + ^ J 

O f2 

( 2 ) 
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Fig. 1. Purely intensity- based segmentation. Contour evolution generated by 
minimizing the Chan-Vese model (2) [3]. The central figure is partially corrupted. 



with respect to the scalar variables and ^2 and the embedding level set 
function (f>. Here H denotes the Heaviside function 



H{^) = 




(j)>0 

else 



( 3 ) 



The last term in (2) measures the length of the zero-crossing of (j>. 

The Euler-Lagrange equation for this functional is implemented by gradient 
descent: 



d(j) 

Ih 



m 



V div 




(/ ~ + (/ ~ ^2)^ , 



( 4 ) 



where fj,i and ^2 are updated in alternation with the level set evolution to take 
on the mean gray value of the input image / in the regions defined by (^ > 0 
and Ip < respectively: 



^ J fix) H{(j))dx ^ f f(x)(l - Hj(t>))dx 
J H{(l))dx ’ J{1 — H{(j)))dx 

Figure 1 shows a representative contour evolution obtained for an image 
containing three figures, the middle one being partially corrupted. 



3 Selective Shape Priors by Dynamic Labeling 

The evolution in Figure 1 demonstrates the well-known fact that the level set 
based segmentation process can cope with multiple objects in a given scene. How- 
ever, if the low-level segmentation criterion is violated due to noise, background 
clutter or partial occlusion of the objects of interest, then the purely image-based 
segmentation scheme will fail to converge to the desired segmentation. 

To cope with such degraded low-level information, it was proposed to intro- 
duce prior shape knowledge into the level set scheme (cf. [9,17,15]). The basic 
idea is to extend the image-based cost functional by a shape energy which favors 
certain contour formations: 



ddtotalif^) — (/Ti , /T2 , 0) ^ 0) ■ 



( 6 ) 
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Fig. 2. Global shape prior. Contour evolution generated by minimizing the total en- 
ergy (6) with a global shape prior of the form (7) encoding the figure in the center. Due 
to the global constraint on the embedding function, the familiar object is reconstructed 
while all unfamiliar structures are suppressed in the final segmentation. The resulting 
segmentation scheme lost its capacity to deal with multiple independent objects. 



In general, the proposed shape constraints affect the embedding surface <f> 
globally (i.e. on the entire domain 17). In the simplest case, such a prior has the 
form: 

Eshape{4>) = J dx, (7) 

o 

where (po is the level set function embedding a given training shape (or the mean 
of a set of training shapes). Uniqueness of the embedding function associated 
with a given shape is guaranteed by imposing pQ to be a signed distance function 
(cf. [9]). For consistency, we also project the segmenting level set function p to 
the space of distance functions during the optimization [16]. 

Figure 2 shows several steps in the contour evolution with such a prior, where 
00 is the level set function associated with the middle figure. The shape prior 
permits to reconstruct the object of interest, yet in the process, all unfamiliar 
objects are suppressed from the segmentation. The segmentation process with 
shape prior obviously lost its capacity to handle multiple (independent) objects. 

In order to retain this favorable property of the level set method, it was 
proposed in [7] to introduce a labeling function L : 17 — >■ M, which indicates the 
regions of the image where a given prior is to be enforced. During optimization 
of an appropriate cost functional, the labeling evolves dynamically in order to 
select these regions in a recognition-driven way. The corresponding shape energy 
is given by: 

E,hapei4>,L) = Jicj)-Pof{L + lfdx + j\^ {L-lfdx + -ij\VH{L)\dx, 

( 8 ) 

with two parameters A,y > 0. The labeling L enforces the shape prior in those 
areas of the image where the level set function is similar to the prior (associated 
with labeling L = 1). In particular, for fixed 0, minimizing the first two terms 
in (8) induces the following qualitative behavior of the labeling: 

L — -l- 1 , if 1 0 — 00 1 ^ A 

T — >■ — 1, if 10 — 0o| > A 
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Evolution of the segmenting contour. 





the simultaneously evolving labeling function. 




Zero-crossing of labeling function superimposed on the input image. 



Fig. 3. Selective shape prior by dynamic labeling. Contour evolution generated 
by minimizing the total energy (6) with a selective shape prior of the form (8) encoding 
the figure in the center. Due to the simultaneous optimization of a labeling function 
L{x) (middle and bottom row), the shape prior is restricted to act only in selected 
areas. The familiar shape is reconstructed, while the correct segmentation of separate 
(unfamiliar) objects remains unaffected. The resulting segmentation scheme thereby 
retains its capacity to deal with multiple independent objects. In this and all subsequent 
examples, labeling functions are initialized by L = 0. 



In addition, the last term in equation (8) imposes a regularizing constraint on the 
length of the zero crossing of the labeling, this induces topological “compactness” 
of both the regions with and without shape prior. 

Figure 3 shows the contour evolution generated with the prior (8), where </>o 
encodes the middle figure as before. Again the shape prior permits to reconstruct 
the corrupted figure. In contrast to the global prior (7) in Figure 2, however, the 
process dynamically selects the region where to impose the prior. Consequently 
the correct segmentation of the two unknown objects is unaffected by the prior. 

4 A Pose-Invariant Formnlation 

In the above formalism of dynamic labeling, the pose of the object of interest is 
assumed to be known. In a realistic segmentation problem, one generally does 
not know the pose of an object of interest. If the object of interest is no longer 
in the same location as the prior (jjQ, the labeling approach will fail to generate 
the desired segmentation. This is demonstrated in Figure 4. While the labeling 
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Fig. 4. Missing pose optimization. Evolution of contour (yellow) and labeling (blue) 
with selective shape prior (8) and a displaced template (fio. Without simultaneous pose 
optimization, the familiar shape is forced to appear in the displaced position. 




Fig. 5. Effect of pose optimization. By simultaneously optimizing a set of pose 
parameters in the shape energy (9), one jointly solves the problems of estimating the 
area where to impose a prior and the pose of the respective prior. Note that the pose 
estimate is gradually improved during the energy minimization. 

still separates areas of known objects from areas of unknown objects, the known 
shape is not reconstructed correctly, since the pose of the prior and that of the 
object in the image differ. 

A possible solution is to introduce a set of pose parameters associated with 
a given prior 0o (cf- [15,5]). The corresponding shape energy 

Eshape(4>^L,s,0,h) = J - ^(f)o{sRgx + h)j {L + l)'^dx 

+ jx'^{L-l)^dx + -fJ\VH{L)\dx (9) 

is simultaneously optimized with respect to the segmenting level set function (f>, 
the labeling function L and the pose parameters, which account for translation 
h, rotation by an angle 0 and scaling s of the template. The normalization by s 
guarantees that the resulting shape remains a distance function. 

Figure 5 shows the resulting segmentation: Again the labeling selects the re- 
gions where to apply the given prior, but now the simultaneous pose optimization 
also allows to estimate the pose of the object of interest. 

The main focus of the present paper is to propose selective shape priors. For 
the sake of simplifying the exposition, we will therefore assume in the following, 
that the correct pose of familiar objects is known. Moreover, we will drop pose 
parameters associated with each shape template from the equations, so as to 
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simplify the notation. We want to stress, however, that similar pose invariance 
can be demonstrated for all of the following generalizations. 

5 Extension to Two Known Objects 

A serious limitation of the labeling approach in (8) is that it only allows for a 
single known object (and multiple unknown objects). What if there are several 
familiar objects in the scene? How can one integrate prior knowledge about 
multiple shapes such as those given by a database of known objects? Before 
considering the general case, let us first study the case of two known objects. 

The following modification of (8) allows for two different familiar objects 
associated with embedding functions <f>i and <f> 2 ' 

Eshape{(t>, L) = \ [{(!>- {L + lf-dx+ \ f {(j)- (j) 2 f {L - 1)^ dx 

+ jJ\VH{L)\dx. (10) 

The terms associated with the two objects were normalized with respect to the 
variance of the respective template: erf = f (j)fdx—{f (j)idx)'^ . The resulting shape 
prior has therefore merely one (instead of two) free parameters. The evolution of 
the labeling function is now driven by two competing shape priors: each image 
location will be ascribed to one or the other prior. 

Figure 6 shows a comparison: The upper row indicates the contour evolution 
generated with the shape energy (8), where 4>o encodes the figure on the left. 
The lower row shows the respective evolution obtained with the shape energy 
10, with (j)i and (j )2 encoding the left and right figures, respectively. Whereas the 
object on the right (occluded by a pen) is treated as unknown in the original 
formulation (upper row), both figures can be reconstructed by simultaneously 
imposing two competing priors in different domains (lower row). 



6 The General Case: Multiphase Dynamic Labeling 

The above example showed that the dynamic labeling approach can be trans- 
formed to allow for two shape priors rather than a single shape prior and possible 
background. 

Let us now consider the general case of a larger number of known objects and 
possibly some further independent unknown objects (which should therefore be 
segmented based on their intensity only). To this end, we introduce a vector- 
valued labeling function 

L(x) = (Li(:r),...,L„(x)). (11) 

We employ the m = 2" vertices of the polytope [—1, -1-1]" to encode m different 
regions, Lj G {-1-1, —1}, and denote by Xi, i = 1, . . . , m the indicator function for 
each of these regions. See [19] for a related concept in the context of multi-region 
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Dynamic Labeling with a single prior and background. 




Dynamic Labeling allowing for two competing priors. 



Fig. 6. Extension to two priors. Evolutions of contour (yellow) and labeling (blue) 
generated by minimizing energy (6) with a selective prior of the form (8) encoding the 
left figure (top) and with a selective prior of the form (10) encoding both figures 
(bottom). In both cases, the left figure is correctly reconstructed despite prominent 
occlusions by the scissors. However, while the structure on the right is treated as 
unfamiliar and thereby segmented based on intensities only (top row), the extension to 
two priors permits to simultaneously reconstruct both known objects (bottom row). 



segmentation. For example, for n = 2, four regions are modeled by the indicator 
functions: 

Xi{L) = ULi - 1)^ {L2 - l)^ X2{L) = ULi + 1)^ (i2 - l)^ 

X3(i) = - 1)^ (^2 + l)^ X4(i) = ULi + 1)^ (i2 + 1)^. 

In the general case of an n-dimensional labeling function, each indicator function 
will be of the form 

1 " 

Xi(i) = X/i. ../„(-£')= with /j€{+l,— 1}. (12) 

i=i 

With this notation, the extension of the dynamic labeling approach to up to 
m = 2" regions can be cast into a cost functional of the form: 

^total (0j , 1 ^ 2 ) — Ml : M 2 ) “b shape (0?.^); 

m—1 r m „ 

Eshape{(^,L)= Y, / X\Mdx + xY |\^HiL^)\dx. 

2=1 ^ 2 = 1 ^ 
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Here, each (j)i corresponds to a particular known shape with its variance given 
by cTi. 

As mentioned before, we have - for better readability - neglected the pose 
parameters associated with each template. These can be incorporated by the 
replacements: 

^ ^i) and Eshape(^4^: ^ P) ? 

where p = (pi, . . . ,Pm) denotes the vector of pose parameters pi = {si,9i,hi) 
associated with each known shape. 



7 Energy Minimization 



In the previous sections, we have introduced variational formulations of increas- 
ing complexity to tackle the problem of multi-object segmentation with shape 
priors. The corresponding segmentation processes are generated by minimizing 
these functionals. In this section, we will detail the minimization scheme in or- 
der to illuminate how the different components of the proposed cost functionals 
affect the segmentation process. Let us focus on the case of multiple labels corre- 
sponding to the cost functional (13). Minimization of this functional is obtained 
by alternating the update of the mean intensities and p 2 according to (5) with 
a gradient descent evolution for the level set function (j>, the labeling functions 
Lj and the associated pose parameters pj. In the following, we will detail this 
for (j) and Lj . Respective evolution equations for pj are straight forward and not 
our central focus. 

For fixed labeling, the evolution of the level set function (j) is given by: 



^ ^ dEtotal 

dt d(j) 



dEcv 

d(j) 



m—1 

2a ^ 
2 = 1 




(14) 



Apart from the image-driven first component given by the Chan-Vese evolution 
in equation (4), we additionally have a relaxation toward the template 4>i in all 
image locations where yj = 1. 

Minimization by gradient descent with respect to the labeling functions Lj 
corresponds to an evolution of the form: 



18L, '^{4,-hrBxiiL) (VLA 



where the derivatives of the indicator functions Xi E^re easily obtained from (12). 
The first two terms in (15) drive the labeling L to indicate the template 4>i which 
is most similar to the given function (j) (or alternatively the background). The 
last term minimizes the length of the zero crossing of Lj. This has two effects: 
Firstly, it induces the labeling to decide for one of the possible templates (or the 
background), i.e. mixing of templates with label values between -|-1 and —1 are 
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Evolution of the segmentation with multiphase dynamic labeling. 





Labeling 1 



Labeling 2 




Fig. 7. Coping with several objects by multiphase dynamic labeling. Contour 
evolution generated by minimizing the total energy (6) with a multiphase selective 
shape prior of the form (13) encoding the three figures on the left, center and right. 
The appearance of all three objects is corrupted. Due to the simultaneous optimization 
of a vector-valued labeling function, several regions associated with each shape prior 
are selected, in which the given prior is enforced. All familiar shapes are segmented 
and restored, while the correct segmentation of separate (unfamiliar) objects remains 
unaffected. The images on the bottom show the final labeling and - for comparison - 
the segmentation without prior (right). 



suppressed. Secondly, it enforces the decision regions (regions of constant label) 
to be “compact”, because label flipping is energetically unfavorable. 

Figure 7 shows a contour evolution obtained with the multiphase dynamic 
labeling model (13) and n = 2 labeling functions. The image contains three cor- 
rupted objects which are assumed to be familiar and one unfamiliar object (in 
the top left corner). The top row shows the evolution of the segmenting con- 
tour (yellow) superimposed on the input image. The segmentation process with 
a vector-valued labeling function selects regions corresponding to the different 
objects in an unsupervised manner and simultaneously applies three competing 
shape priors which permit to reconstruct the familiar objects. Corresponding 3D 
plots of the two labeling functions in the bottom rows of Figure 7 show which 
areas of the image have been associated with which label configuration. For ex- 
ample, the object in the center has been identified by the labeling L = (-1-1, — 1). 

8 Conclusion 

We introduced the framework of multiphase dynamic labeling, which allows 
to integrate multiple competing shape priors into level set based segmentation 
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schemes. The proposed cost functional is simultaneously optimized with respect 
to a level set function defining the segmentation, a vector-valued labeling func- 
tion indicating regions where particular shape priors should be enforced, and a 
set of pose parameters associated with each prior. Each shape prior is given by 
a fixed template and respective pose parameters, yet the extension to statistical 
shape priors (which additionally allow deformation modes) is straight forward. 

We argued that the proposed mechanism fundamentally generalizes previous 
approaches to shape priors in level set segmentation. Firstly, it is consistent 
with the philosophy of level sets because it retains the capacity of the resulting 
segmentation scheme to cope with multiple independent objects in a given image. 
Secondly, it addresses the central question of where to apply which shape prior. 

The selection of appropriate regions associated with each prior is generated 
by the dynamic labeling in a recognition-driven manner. In this sense, our work 
demonstrates in a specific way how a recognition process can be modeled in a 
variational segmentation framework. 
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Abstract. For shapes represented as closed planar contours, we intro- 
duce a class of functionals that are invariant with respect to the Eu- 
clidean and similarity group, obtained by performing integral operations. 
While such integral invariants enjoy some of the desirable properties 
of their differential cousins, such as locality of computation (which al- 
lows matching under occlusions) and uniqueness of representation (in 
the limit), they are not as sensitive to noise in the data. We exploit the 
integral invariants to define a unique signature, from which the original 
shape can be reconstructed uniquely up to the symmetry group, and a 
notion of scale-space that allows analysis at multiple levels of resolution. 
The invariant signature can be used as a basis to define various notions 
of distance between shapes, and we illustrate the potential of the integral 
invariant representation for shape matching on real and synthetic data. 



1 Introduction 

Geometric invariance is an important issue in computer vision that has received 
considerable attention in the past. The idea that one could compute functions 
of geometric primitives of the image that do not change under the various nui- 
sances of image formation and viewing geometry was appealing; it held potential 
for application to recognition, correspondence, 3-D reconstruction, and visual- 
ization. The discovery that there exist no generic viewpoint invariants was only 
a minor roadblock, as image deformations can be approximated with homo- 
graphies; hence the study of invariants to projective transformations and their 
subgroups (affine, similarity. Euclidean) flourished. Toward the end of the last 
millennium, the decrease in popularity of research on geometric invariance was 
sanctioned mostly by two factors: the progress on multiple view geometry (one 
way to achieve viewpoint invariance is to estimate the viewing geometry) and 
noise. Ultimately, algorithms based on invariants did not meet expectations be- 
cause most entailed computing various derivatives of measured functions of the 
image (hence the name “differential invariants”). As soon as noise was present 
and affected the geometric primitives computed from the images, the invariants 
were dominated by the small scale perturbations. Various palliative measures 
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were taken, such as the introduction of scale-space smoothing, but a more prin- 
cipled approach has so far been elusive. Nowadays, the field is instead engaged 
in searching for invariant (or insensitive) measures of photometric (rather than 
geometric) nuisances in the image formation process. Nevertheless, the idea of 
computing functions that are invariant with respect to group transformations of 
the image domain remains important, because it holds the promise to extract 
compact, efficient representations for shape matching, indexing, and ultimately 
recognition. 

In this paper, we introduce a general class of invariants that are integral 
functionals of the data, as opposed to differential ones. We argue that such 
functionals are far less sensitive to noise, while retaining the nice features of dif- 
ferential invariants such as locality, which allow for matching under occlusions. 
They can be exploited to define invariant signature curves that can be used as 
a representation to define various notions of distances between shapes. We re- 
strict our analysis to Euclidean and similarity invariants, although extensions to 
the affine group are straightforward. The integration kernel allows us to define 
intrinsic scale-spaces of invariant signatures, so that we can represent shapes 
at different levels of resolution and under various levels of measurement noise. 
We also show that our invariants can be computed very efficiently without per- 
forming explicit sums (in the discretized domain). Finally, we show that in the 
limit where the kernel measure goes to zero, one class of integral invariant is in 
one-to-one correspondence with the prince of differential invariants, curvature. 
This allows the establishment of a completeness property of the representation, 
in the limit, in that a given shape can be reconstructed uniquely, up to the in- 
variance group, from its invariant signature. This relationship allows us to tap 
into the rich literature on differential invariants for theoretical results, while in 
our experiments we can avoid computing higher-order derivatives. We illustrate 
our results with several experiments, showed as space allows. 

2 Relation to Existing Work, and Onr Contribution 

The role of invariants in computer vision has been advocated for various applica- 
tions ranging from shape representation [34,4] to shape matching [3,29], quality 
control [48,11], and general object recognition [39,1]. Consequently a number of 
features that are invariant under specific transformations have been investigated 
[14,25,15,21,33,46]. In particular, one can construct primitive invariants of al- 
gebraic entities such as lines, conics and polynomial curves, based on a global 
descriptor of shape [36,18]. In addition to invariants to transformation groups, 
considerable attention has been devoted to invariants with respect to the geomet- 
ric relationship between 3D objects and their 2D views; while generic viewpoint 
invariants do not exist, invariant features can be computed from a collection 
of coplanar points or lines [40,41,20,6,17,52,1,45,26]. An invariant descriptor of 
a collection of points that relates to our approach is the shape context intro- 
duced by Belongie et al. [3], which consists in a radial histogram of the relative 
coordinates of the rest of the shape at each point. 

Differential invariants to actions of various Lie groups have been addressed 
thoroughly [28,24,13,35]. An invariant is defined by an unchanged subset of the 
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manifold which the group transformation is acting on. In particular, an invariant 
signature which pairs curvature and its first derivative avoids parameterization 
in terms of arc length [10,37]. Calabi and coworkers suggested numerical expres- 
sions for curvature and first derivative of curvature in terms of joint invariants. 
However, it is shown that the expression for the first derivative of curvature is 
not convergent and modified formulas are presented in [5]. 

In order to reduce noise-induced fluctuations of the signature, semi- 
differential invariants methods are introduced by using first derivatives and one 
reference point instead of curvature, thus avoiding the computation of high-order 
derivatives [38,19,27]. Another semi-invariant is given by transforming the given 
coordinate system to a canonical one [49] . 

A useful property of differential and (some) semi-differential invariants is 
that they can be applied to match shapes despite occlusions, due to the locality 
of the signature [8,7]. However, the fundamental problem of differential invari- 
ants is that high-order derivatives have to be computed, amplifying the effect of 
noise. There have been several approaches to decrease sensitivity to noise by em- 
ploying scale-space via linear filtering [50] . The combination of invariant theory 
with geometric multiscale analysis is investigated by applying an invariant diffu- 
sion equation for curve evolution [42,43,12]. A scale-space can be determined by 
varying the size of the differencing interval used to approximate derivatives using 
finite differences [9]. In [32], a curvature scale-space was developed for a shape 
matching problem. A set of Gaussian kernels was applied to build a scale-space 
of curvature whose extrema were observed across scales. 

To overcome the limitations of differential invariants, there have been at- 
tempts to derive invariants based on integral computations. A statistical ap- 
proach to describe invariants was introduced using moments in [23]. Moment 
invariants under affine transformations were derived from the classical moment 
invariants in [16]. They have a limitation in that high-order moments are sen- 
sitive to noise which results in high variances. The error analysis and analytic 
characterization of moment descriptors were studied in [30] . The Fourier trans- 
form was also applied to obtain integral invariants [51,31,2]. A closed curve was 
represented by a set of Fourier coefficients and normalized Fourier descriptors 
were used to compute affine invariants. In this method, high-order Fourier co- 
efficients are involved and they are not stable with respect to noise. Several 
techniques have been developed to restrict the computation to local neighbor- 
hoods: the Wavelet transform was used for affine invariants using the dyadic 
wavelet in [47] and potentials were also proposed to preserve locality [22]. Alter- 
natively, semi-local integral invariants are presented by integrating object curves 
with respect to arc length [44]. 

In this manuscript, we introduce two general classes of integral invariants; 
for one of them, we show its relationship to differential invariants (in the limit), 
which allows us to conclude that the invariant signature curve obtained from 
the integral invariant is in one-to-one correspondence with the original shape, 
up to the action of the nuisance group. We use the invariant signature to define 
various notions of distance between shapes, and we illustrate the potential of 
our representation on several experiments with real and simulated images. 
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3 Integral Invariants 



Throughout this section we indicate with 7 : — >■ a closed planar contour 

with arclength ds, and G a group acting on with dx the area form on We 
also use the formal notation 7 to indicate either the interior of the region bounded 
by 7 (a two-dimensional object), or the curve 7 itself (a one-dimensional object), 
and dfj,{x) the corresponding measure, i.e. the area form dx or the arclength ds{x) 
respectively. With this notation, we can define a fairly general notion of integral 
invariant. 

Definition 1. A function Ij{p) : R^ — >■ R zs an integral G-invariant if there 
exists a kernel ft, : R^ x R^ — >■ R such that 




h{p, x)dp{x) 



( 1 ) 



where ft(-,-) satisfies 



/ h{p,x)dp,{x) = / h{gp,x)dfj,{x) y g £ G. 

Jy Jgy 



(2) 



where gj = {gx \ g £ G,x £ 7}, and similarly for 57. 

The definition can be extended to vector signatures, or to multiple integrals. 
Note that the point p does not necessarily lie on the contour 7, as long as there 
is an unequivocal way of associating p G R^ to 7 (e.g. the centroid of the curve). 



Example 1 (Integral distance invariant). Consider G = SE{2) and the 
following function, computed at every point p G 7.' 

I'i(p) = [ d{p,x)ds{x) (3) 

Jy 



where d{x,y) = \y — x\ is the Euclidean distance in R^. This is illustrated in 
Fig. 1-a. 




Fig. 1. (Left) Integral distance invariant defined in eq. (3), made local by means of a 
kernel as described in eq. (5). (Right) Integral area invariant defined by eq. (6). 




Integral Invariant Signatures 



91 



It is immediate to show that this is an integral Euclidean invariant. The 
function Ij associates to each point on the contour a number that is the average 
distance from that point to every other point on the contour. In particular, if the 
point p € j is parameterized by arclength, the invariant can be interpreted as a 
function from [0, L], where L is the length of the curve, to the positive reals: 

{7 : Si ^ ^ {I^{p{s)) : [0, L] ^ K+.} (4) 

This invariant is computed for a few representative shapes in Fig. 2 and Fig. 3. 

A more “local” version of the invariant signature can be obtained by weighting 
the integral in eq. (3) with a kernel q{p, x), so that Ij(p) = h{p, x)ds{x) where 



h{p,x) = q{p,x)d{p,x). (5) 

The kernel q{-,-) is free for the designer to choose depending on the final goal. 
This local integral invariant can be thought of as a continuous version of the 
“shape context which was designed for a finite collection of points [3] . The dif- 
ference is that the shape context signature is a local radial histogram of neigh- 
boring points, whereas in our case we only store the mean of their distance. 



Example 2 (Integral area invariant). Consider now the kernel 
h{p,x) = x{Br{p) bl 7 )(x), which represents the indicator function of the inter- 
section of a small circle of radius r centered at the point p with the interior of 
the curve 7. For any given radius r, the corresponding integral invariant 

r^{p) = f dx (6) 

J Sr(p)ri7 

can be thought of as a function from the interval [0, L] to the positive reals, 
bounded above by the area of the region bounded by the curve 7. This is illustrated 
in Fig. 1-b and examples are shown in Fig. 2 and Fig. 3. 

Naturally, if we plot the value of If{p{s)) for all values of s and r ranging 
from zero to a maximum radius so that the local kernel encloses the entire curve 
Br{p) 2) 7, we can generate a graph of a function that can be interpreted as a 
scale-space of integral invariants. Furthermore, x{Br{p)) can be substituted by 
a more general kernel, for instance a Gaussian centered at p with a = r. 

Example 3 (Differential invariant). Note that a regularized version of cur- 
vature, or in general a curvature scale space, can be interpreted as an integral 
invariant, since regularized curvature is an algebraic function of the first- and 
second-regularized derivatives [32]. Therefore, integral invariants are more gen- 
eral, but we will not exploit this added generality, since it contrary to the spirit of 
this manuscript, that is of avoiding the computation of derivatives of the image 
data, even if regularized. 
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Fig. 2. For a set of representative shapes (left column), we compute the distance inte- 
gral invariant of eq. (3) (middle left column), the local area invariant of eq. (6) with 
a kernel size cr = 2 (middle right column). Compare the results with curvature, shown 
in the rightmost column. 
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Noisy Shape Distance invariant Local area invariant Curvature 



Fig. 3. For a noisy shape (left column), the distance invariant of eq. (3) with a kernel 
size of (T = 30 (middle left column), the local area invariant of eq. (6) with kernel size 
r = 10 (middle right column) and the differential invariant, curvature (right column). 
As one can see, noise is amplified in the computation of derivatives necessary to extract 
curvature. 




vs. kernel size 

Fig. 4. For a noisy shape (left), the local area invariant of eq. (6) as a function of kernel 
size induces a scale-space of responses. 
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4 Relationship with Curvature and Local Differential 
Invariants 



In this section we study the relationship between the local area invariant (6) and 
curvature. This is motivated by the fact that curvature is a complete invariant, 
in the sense that it allows the recovery of the original curve up to the action of 
the symmetry group. Furthermore, all differential invariants of any order on the 
plane are functions of curvature [49] , and therefore linking our integral invariant 
to curvature would allow us to tap onto the rich body of results on differen- 
tial invariants without suffering from the shortcomings of computing high-order 
derivatives of the data. 

We first assume that 7 is smooth, so that a notion of curvature is well- 
defined, and the curve can be approximated locally by the osculating circle^ 
Br{p) (Fig. 1-b). The invariant /l[(p) denotes the area of the intersection of a 
circle Br{p) with the interior of 7, and it can be approximated to first-order by 
the area of the shaded sector in Fig. 1-b, i.e. B^{p) — r‘^0{p). Now, the angle 9 
can be computed as a function of r and R using the cosine law: cos 9 = rj2R, 
and since curvature k is the inverse of R we have 



/([ (p) ~ arccos 




(7) 



Now, since arc-cosine is an invertible function, to the extent in which the ap- 
proximation above is valid (which depends on r ) , we can recover curvature from 
the integral invariant. 

The approximation above is valid in the limit when r — >■ 0; as r increases, 
Bj.{p) encloses the entire curve 7 (which is closed), and consequently /(i becomes 
a constant beyond a certain radius r = Cmax- Therefore, for values of r that range 
from 0 to Cmax we obtain an intrinsic scale-space of invariants, in contrast to the 
extrinsic scale-space of curvature. We compare these two descriptors in Fig. 3 
and Fig. 4. 

Note also that the integral invariant can be normalized via so as to 

provide a scale-invariant description of the curve, which is therefore invariant 
with respect to the similarity group. The corresponding integral invariant is then 
bounded between 0 and 1. 



5 Invariant Signature Curves 

The invariant /(!(p(s)) can be represented by a function of s for any fixed value 
of r. This means, however, that in order to register two shapes, an “initial point” 
s = 0 must be chosen. There is nothing intrinsic to the geometry of the curve 
in the choice of this initial point, and indeed it would be desirable to devise a 
description that, in addition to being invariant to the group, is invariant with 
respect to the choice of initial point. 

^ Notice that our invariant does not require that the shape be smooth, and this as- 
sumption is made only to relate our results to the literature on differential invariants. 
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Shape Local area Local area Curvature 
Invariant Invariant Invariant 
(r = 2) (r = 5) 



Fig. 5. Example of signature curves for a set of representative shapes (left column); 
local area invariant with small kernel (middle left column) and large kernel (middle 
right column), differential invariant (right column). 



In order to do so, we follow the classic literature on differential invariants (see 

[10] and references therein) and plot a signature, that is the graph of 
versus /)[. We indicate such a signature concisely by 

(8) 

which of course can be plotted for all values of r G [0, Tmax], yielding a 
scale-space of signatures. Naturally, we want to avoid direct computation of 
the derivative of the invariant, so the signature can be computed more sim- 
ply as follows: Consider the binary image xil) convolve it with the kernel 
h{p, x) = Br{p — x), where p G not just the curve 7. Evaluating the result of 
this convolution on p G 7 yields Jl[, without the need to parameterize the curve. 
For 7^, compute the gradient of the filter response and inner- multiply the result 
with the tangent vector field of the image x(7), formed by filtering again by a 
kernel different than Br{p — x) and rotating its normalized gradient by 90°. The 
result, when evaluated at p G 7, yields If. 

Notice that from the integral invariant signature we can reconstruct all dif- 
ferential invariants in the limit when r — >■ 0. In fact, from If we can compute k, 
and therefore from the signature we can compute k. 



6 Distance between Shapes 

In this section we outline methods for computing the distance between two 
shapes based on their invariants and invariant signatures curves. 
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Fig. 6. Noisy shape recognition from a database of 23 shapes. The upper number in 
each cell is the distance computed via the local-area integral invariant; the lower number 
is the distance computed via curvature invariant. The number in italics represents the 
best match for a noisy shape. See the text for more details 

A straightforward distance between two shapes 71 and 72 is to compute a 
measure of the error between their invariants. One choice is the squared error. 

DE{l^,lj,r)= [ (/;.(p(s)) (p(s)))2(is. ( 9 ) 

do 

While this squared error can be computed for any invariant functional, we focus 
on invariants that preserve locality, such as the local area invariant, so that these 
distances will be valid for application to shape recognition despite occlusion. 
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Fig. 7. Summary of noisy shape recognition from a database of 23 shapes. 



However, as discussed in Sec. 5 this computation is sensitive to the parame- 
terization of the shapes, specifically the assignment of the initial point. To avoid 
this dependence, the distance in eq. (9) must be optimized with respect to the 
choice of s = 0. We demonstrate the application of distance computed in this 
way in the Sec. (7), where we also define a distance based on curvature in the 
same way. 

As an alternative to optimizing De, ~vfe can define a distance on a parameter- 
independent representation, such as the signature. The symmetric Hausdorff 
distance between signature curves (represented as point sets), 

Dnili, lj,r) = i7((4, /;), , /;,)) (10) 

is one such distance. Hausdorff distance does not rely on correspondence between 
points, which is advantageous because it provides the parameter-independent 
distance we desire, but problematic when non-corresponding segments of the 
signatures are perturbed so that they overlap. 

However, other measures that characterize the signature, such as winding 
number, can be integrated in into the distance measure to better discriminate 
these signatures. Additionally, a richer multiscale description of the curve can be 
created by computing the above distances for a set of kernel sizes. The integration 
of multiscale information, along with other measures such as winding number, 
is the subject of ongoing investigation. 



7 Experiments 

In this section we apply the invariant shape descriptions to the problem of 
Euclidean-invariant matching of shapes in noise. In Fig. 6, we demonstrate shape 
matching in a collection of 23 shapes, and summarize the results in Fig. 7. The 
collection contains several groups of shapes; shapes within a group are similar 
(i.e. different breeds of fish), but the groups are quite different (intuitively, hands 
are not like fish). 

The figure shows the distance between the shapes (shown on the left side) 
and noisy versions of the shapes (shown across the top) . Within each block are 
two distances; on top, the integral invariant distance De defined in the previous 
section, and on the bottom the differential invariant distance defined similarly. 
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In each column, the lowest distance for the shape shown at the top of the 
column is shown in italics. The distance based on the integral invariant finds 
the correct match (i.e. the distance between a noisy shape and the correct pair 
is lowest) in all but one case. The exception is the noisy, rotated hand (fourth 
column from the right), which has equal distance to itself and its unrotated 
neighbor, demonstrating the invariance to rotation of this model. Moreover, dis- 
tances between similar shapes are lower than distances between members of 
different groups. 

Matching results based on the differential invariant are not as consistent as 
those based on the integral invariant. There are eight mismatches among the 
23 noisy images; most frequently, when a shape cannot be matched it is paired 
with the triangle (fifth from the right). This may be because the curvature of 
the triangle is zero almost everywhere, and best approximates the mean of many 
of the noisy curvature functions. More generally, and more problematically, for 
some groups distances between similar shapes are higher than distances between 
shapes belonging to other groups, violating the required properties of a distance. 
For instance, the average inter-group distance is 452.8, while the average intra- 
group distance is 316.6! Compare this to an inter-group distance of 11.0, which 
is Zower than the intra-group distance of 17.4 for the integral invariant distance. 

8 Conclusion 

In this paper we have introduced a general class of integral Euclidean- and 
similarity-invariant functionals of shape data. We argue that these functionals 
are less sensitive to noise than differential ones, but can be exploited in similar 
ways, for instance, to define invariant signature curves that can be used as a 
representation to define various notions of shape distance. In addition, the inte- 
gration kernel includes an intrinsic scale-space parameter. We presented efficient 
numerical implementations of these invariants, and, in the limit, established a 
completeness property for the representation by showing a one-to-one correspon- 
dence with curvature. We demonstrated our results with several experiments, 
including an application to shape matching using synthetic and real data. 



References 

1. R. Alferez and Y. F. Wang. Geometric and illnmination invariants for object 
recognition. PAMI, 21(6):505-536, 1999. 

2. K. Arbter, W. E. Snyder, H. Burkhardt, and G. Hirzinger. Applications of afRne- 
invariant fourier descriptors to recognition of 3-d objects. PAMI, 12(7):640-646, 
1990. 

3. S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using 
shape contexts. PAMI, 24(4):509-522, 2002. 

4. A. Bengtsson and J.-O. Eklnndh. Shape representation by multiscale contour 
approximation. PAMI, 13(l):85-93, 1991. 

5. M. Boutin. Numerically invariant signature curves. IJCV, 40(3):235-248, 2000. 

6. R. D. Brandt and F. Lin. Representations that uniquely characterize images mod- 
ulo translation, rotation and scaling. PRL, 17:1001-1015, 1996. 




98 



S. Manay et al. 



7. A. Bruckstein, N. Katzir, M. Lindenbaum, and M. Porat. Similarity invariant 
signatures for partially occluded planar shapes. IJCV, 7(3):271-285, 1992. 

8. A. M. Bruckstein, R. J. Holt, A. N. Netravali, and T. J. Richardson. Invari- 
ant signatures for planar shape recognition under partial occlusion. CVGIPiIU, 
58(l):49-65, 1993. 

9. A. M. Bruckstein, E. Rivlin, and I. Weiss. Scale-space semi-local invariants. IVC, 
15(5):335-344, 1997. 

10. E. Calabi, P. Olver, C. Shakiban, A. Tannenbaum, and S. Haker. Differential 
and numerically invariant signature curves applied to object recognition. IJCV, 
26:107-135, 1998. 

11. D. Chetverikov and Y Khenokh. Matching for shape defect detection. LNCS, 
1689(2) :367-374, 1999. 

12. T. Cohignac, C. Lopez, and J. M. Morel. Integral and local affine invariant pa- 
rameter and applicatioin to shape recognition. ICPR, 1:164-168, 1994. 

13. J. B. Cole, H. Murase, and S. Naito. A lie group theoretical approach to the 
invariance problem in feature extraction and object recognition. PRL, 12:519-523, 
1991. 

14. L. E. Dickson. Algebraic Invariants. John-Weiley & Sons, 1914. 

15. J. Dieudonne and J. Carrell. Invariant Theory: Old and New. Academic Press, 
London, 1970. 

16. J. Flusser and T. Suk. Pattern recognition by affine moment invariants. Pat. Rec., 
26(1):167-174, 1993. 

17. D. A. Forsyth, J. L. Mundy, A. P. Zisserman, C. Coelho, A. Heller, and C. A. Oth- 
well. Invariant descriptors for 3-d object recognition and pose. PAMI, 13(10):971- 
991, 1991. 

18. D.A. Forsyth, J.L. Mundy, A. Zisserman, and C.M. Brown. Projectively invariant 
representations using implicit algebraic curves. IVC, 9(2):130-136, 1991. 

19. L. Van Gool, T. Moons, E. Pauwels, and A. Oosterlinck. Semi-differential invari- 
ants. In J. Mundy and A Zisserman, editors, Geometric Invariance in Computer 
Vision, pages 193-214. MIT, Cambridge, 1992. 

20. L. Van Gool, T. Moons, and D. Ungureanu. Affine/photometric invariants for 
planar intensity patterns. ECCV, 1:642-651, 1996. 

21. J. H. Grace and A. Young. The Algebra of Invariants. Cambridge, 1903. 

22. C. E. Hann and M. S. Hickman. Projective curvature and integral invariants. 
IJCV, 40(3):235-248, 2000. 

23. M. K. Hu. Visual pattern recognition by moment invariants. IRE Trans, on IT, 
8:179-187, 1961. 

24. K. Kanatani. Group Theoretical Methods in Image Understanding. Springer, 1990. 

25. E. P. Lane. Projective Differential Geometry of Curves and Surfaces. University 
of Chicago Press, 1932. 

26. J. Lasenby, E. Bayro-Corrochano, A. N. Lasenby, and G. Sommer. A new frame- 
work for the formation of invariants and multiple-view constraints in computer 
vision. ICIP, 1996. 

27. G. Lei. Recognition of planar objects in 3-d space from single perspective views 
using cross ratio. Robot, and Automat., 6(4):432-437, 1990. 

28. R. Lenz. Group Theoretical Methods in Image Processing, volume 413 of LNCS. 
Springer, 1990. 

29. S. Z. Li. Shape matching based on invariants. In O. M. Omidvar (ed.), editor. 
Progress in Neural Networks : Shape Recognition, volume 6, pages 203-228. Intel- 
lect, 1999. 




Integral Invariant Signatures 



99 



30. S. Liao and M. Pawlak. On image analysis by moments. PAMI, 18(3):254-266, 
1996. 

31. T. Miyatake, T Matsuyama, and M. Nagao. Affine transform invariant curve 
recognition using fourier descriptors. Inform. Processing Soc. Japan, 24(1):64-71, 
1983. 

32. F. Mokhtarian and A. K. Mackworth. A theory of multi-scale, curvature-based 
shape representation for planar curves. PAMI, 14(8):789-805, 1992. 

33. D. Mumford, J. Fogarty, and F. C. Kirwan. Geometric invariant theory. Springer- 
Verlag, Berlin ; New York, 3rd edition, 1994. 

34. D. Mumford, A. Latto, and J. Shah. The representation of shape. IEEE Workshop 
on Comp. Vis., pages 183-191, 1984. 

35. J. L. Mundy and A. Zisserman, editors. Geometric Invariance in Computer Vision. 
MIT, 1992. 

36. L. Nielsen and G. Saprr. Projective area-invariants as an extension of the cross- 
ratio. CVGIP, 54(1):145-159, 1991. 

37. P. J. Olver. Equivalence, Invariants and Symmetry. Cambridge, 1995. 

38. T. Pajdla and L. Van Gool. Matching of 3-d curves using semi-differential invari- 
ants. ICCV, pages 390-395, 1995. 

39. T. H. Reiss. Recognizing planar objects using invariant image features. In LNCS, 
volume 676. Springer, 1993. 

40. C. Rothwell, A. Zisserman, D. Forsyth, and J. Mundy. Canonical frames for planar 
object recognition. ECCV, pages 757-772, 1992. 

41. C. Rothwell, A. Zisserman, D. Forsyth, and J. Mundy. Planar object recognition 
using projective shape representation. IJCV, 16:57-99, 1995. 

42. G. Sapiro and A. Tannenbaum. Affine invariant scale space. IJCV, ll(l):25-44, 
1993. 

43. G. Sapiro and A. Tannenbaum. Area and length preserving geometric invariant 
scale-spaces. PAMI, 17(l):67-72, 1995. 

44. J. Sato and R. Cipolla. Affine integral invariants for extracting symmetry axes. 
IVC, 15(8):627-635, 1997. 

45. A. Shashua and N. Navab. Relative affine structure: Canonical model for 3d from 
2d geometry and applications. PAMI, 18(9):873-883, 1996. 

46. C. E. Springer. Geometry and Analysis of Projeetive Spaces. Freeman, San Fran- 
cisco, 1964. 

47. Q. M. Tieng and W. W. Boles. Recognition of 2d object contours using the wavelet 
transform zero-crossing representation. PAMI, 19(8):910-916, 1997. 

48. J. Verestoy and D. Chetverikov. Shape detect detection in ferrite cores. Machine 
Graphics and Vision, 6(2):225-236, 1997. 

49. I. Weiss. Noise resistant invariants of curves. PAMI, 15(9):943-948, 1993. 

50. A. P. Witkin. Scale-space hltering. Int. Joint. Conf. AI, pages 1019-1021, 1983. 

51. G. T. Zahn and R. Z. Roskies. Fourier descriptors for plane closed curves. Trans. 
Comp., 21:269-281, 1972. 

52. A. Zisserman, D.A. Forsyth, J. L. Mundy, C. A. Rothwell, and J. S. Liu. 3D object 
recognition using invariance. Art. Int., 78:239-288, 1995. 




Detecting Keypoints with Stable Position, Orientation, 
and Scale under Illumination Changes* 



Bill Triggs 

GRAVIR-CNRS-INRIA, 655 avenue de I’Europe, 38330 Montbonnot, France 
Bill . TriggsSinrialpes . f r 
http : //www. inrialpes .f r/lear/people/triggs 



Abstract. Local feature approaches to vision geometry and object recognition are 
based on selecting and matching sparse sets of visually salient image points, known 
as ‘keypoints’ or ‘points of interest’. Their performance depends critically on the 
accuracy and reliability with which corresponding keypoints can be found in sub- 
sequent images. Among the many existing keypoint selection criteria, the popular 
Forstner-Harris approach explicitly targets geometric stability, dehning keypoints 
to be points that have locally maximal self-matching precision under translational 
least squares template matching. However, many applications require stability in 
orientation and scale as well as in position. Detecting translational keypoints and 
verifying orientation/scale behaviour post hoc is suhoptimal, and can be misleading 
when different motion variables interact. We give a more principled formulation, 
based on extending the Forstner-Harris approach to general motion models and 
robust template matching. We also incorporate a simple local appearance model to 
ensure good resistance to the most common illumination variations. We illustrate 
the resulting methods and quantify their performance on test images. 

Keywords: keypoint, point of interest, corner detection, feature based vision, 
Forstner-Harris detector, template matching, vision geometry, object recognition. 



Local-feature-based approaches have proven successful in many vision problems, in- 
cluding scene reconstruction [16,5], image indexing and object recognition [20,21,32, 
33,23,24,25]. The basic idea is that focusing attention on comparatively sparse sets of 
especially salient image points — usually called keypoints or points of interest — both 
saves computation (as most of the image is discarded) and improves robustness (as there 
are many simple, redundant local cues rather than a few powerful but complex and deli- 
cate global ones) [37]. However, local methods must be able to find ‘the same’ keypoints 
again in other images, and their performance depends critically on the reliability and 
accuracy with which exactly corresponding points can be found. Many approaches to 
keypoint detection exist, including ‘corners’ [2,17,38,28,4], parametric image models 
[3,31, 1], local energy / phase congruency [27,29,30, 18], and morphology [35, 19]. One 
of the most popular is that developed by Forstner & Giilch [7, 9] and Harris & Stephens 
[15] following earlier work by Hannah [14] and Moravec [26]. This approach brings 
the accuracy issue to the fore by defining keypoints to be points at which the predicted 
precision of local least squares image matching is locally maximal [14,22,6, 10, 12, 1 1]. 

* This research was supported by the European Union FET-Open research project VIBES. 
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Notionally, this is implemented by matching the local image patch against itself under 
small translations, using one of a range of criteria to decide when the ‘sharpness’ of the 
resulting correlation peak is locally optimal. Moravec did this by explicit single-pixel 
translations [26] ; Hannah by autocorrelation [14] ; and Forstner by implicit least squares 
matching, using Taylor expansion to re-express the accuracy in terms of the eigenval- 
ues of the scatter matrix or normal matrix of the local image gradients, J 'Vn'VI dx 
[7,9,8]. All of these methods use rectangular patches, usually with a scale significantly 
larger than that of the image gradients used. This is problematic for patches that con- 
tain just one strong feature, because the self-matching accuracy for these is the same 
wherever the feature is in the patch, i.e. the matching-based approach guarantees good 
self-matching accuracy, but not necessarily accurate centring of the patch on a visible 
feature. Working independently of Forstner, Harris & Stephens improved the localization 
performance by replacing the rectangular patches with Gaussian windows (convolutions) 
with a scale similar to that of the derivatives used [15]. With Gaussian-based derivative 
calculations and more careful attention to aliasing, the method has proven to be one of 
the most reliable keypoint detectors, especially in cases where there are substantial image 
rotations, scalings or perspective deformations [33,24]. 

One problem with the Forstner-Harris approach is that it optimizes keypoints only for 
good translational precision, whereas many applications need keypoints that are stable 
not only under translations, but also under rotations, changes of scale, perspective defor- 
mations, and changes of illumination (c./. [34]). In particular, many local feature based 
object recognition / matching methods calculate a vector of local image descriptors at 
each keypoint, and later try to find keypoinfs wifh corresponding descriptors in other im- 
ages [20,21,32,23,24,25]. This usually requires the extraction of a dominant orientation 
and scale at each keypoint, and keypoints that have poorly defined orientations or scales 
tend to produce descriptors that vary too much over re-detections to be useful. Hence, it 
seems useful to develop keypoint detectors that explicitly guarantee good orientation and 
scale stability, and also good stability under local illumination variations. This is the goal 
of the current paper, which generalizes the Forstner-Harris self-matching argument to 
include non-translational motions, and also provides improved resistance to illumination 
variations by replacing simple least squares matching with an illumination-compensated 
matching method related to Hager & Belhumeur’s [13]. 

Much of the paper focuses on the low-level task of characterizing the local stability 
of matching under geometric transformations and illumination variations. The Forstner- 
Harris approach shows that such analysis is a fruitful route to practical keypoint detection 
in the translational case, and we argue that this continues to hold for more general 
transformations. Also note the relationship to invariance: if we use image descriptors 
based at the keypoints for matching, the more invariant the descriptors are to a given type 
of transformation, the less accurate the keypoint detection needs to be with respect to 
these transformations. But exactly for this reason, it is useful to develop detectors whose 
performance under different types of transformations is quantihable and controllable, 
and our approach explicitly does this. We adopt the following basic philosophy: 



(i) There is no such thing as generic keypoints. They should be selected specifically for 
fhe use to which they will be put, using a purpose-designed detector and parameters. 
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(ii) Keypoints are not just positions. Stability in orientation and scale and resistance to 
common types of appearance variations are also needed. 

( Hi ) Each image ( template ) matching method defines a corresponding self-matching based 
keypoint detector. If the keypoints will be used as correspondence hypotheses that are 
later verified by inter-image template matching, the keypoint detector and parameters 
corresponding to the matching method should be used. 

Contents : Section 1 describes our matching based framework for keypoint detection. 
Section 2 gives some specihc examples and implementation details. Section 3 gives a 
few experimental results. 

Notation: X stands for image coordinates, V for ^-derivatives, I, R for the images being 
matched (treated as functions of jc), t for the image motion/warping model, c for the pixel 
comparison functional. Derivatives are always row vectors, e.g. 61 ^ 'VI Sx. For most 
of the paper we assume continuous images and ignore sampling issues. 

1 General Framework 

This section develops a general framework for robust image (template) matching under 
analytical image deformation and appearance variation models, uses it to derive stability 
estimates for locally optimal matches, and applies this to characterize keypoint stability 
under self-matching. 

Template Matching Model: We will use the following generalized error model for 
template matching, explained element-by-element below : 

Q{p,X) = J c {I {t{x,fi),X),R{x),x) dx (1) 

/ is the image patch being matched, R is the reference patch it is being matched against, 
JC is a set of 2D image coordinates centred on R, and c > 0 (discussed further below) is a 
weighted image pixel comparison functional that is integrated over the patch to find the 
overall matching quality metric Q. x' = f (jc, p) is an image motion / warping model that 
maps i?’s coordinates x forwards into J’s natural coordinate system, i.e., I is effectively 
being pulled back (warped backwards) into R’s frame before being compared. The motion 
model t is controlled by a vector of motion parameters /x (2D translation, perhaps 
rotation, scaling, affine deformation . . .). Before being compared, I may also undergo an 
optional appearance correction controlled by a vector of appearance parameters A (e.g., 
luminance or colour shifts/rescalings/normalizations, corrections for local illumination 
gradients...). Note that we think of the input patch I as an ad hoc function /(jc, A) of 
both the position and appearance parameters, rather than as a fixed image I (jc) to which 
separate appearance corrections are applied. This allows the corrections to be image- 
content dependent and nonlocal within the patch {e.g. subtracting the mean in Zero Mean 
Cross Correlation). We assume that /x = 0 represents a neutral position or reference 
transformation for the patch (e.g. no motion, f(jc, 0) = jc). Similarly, A = 0 represents a 
default or reference appearance setting {e.g. the unchanged input, /(jc, 0) = I{x)). 




Detecting Keypoints with Stable Position, Orientation, and Scale 



103 



The patch comparison integral is over a spatial window centred on R, but for com- 
pactness we encode this in the pixel comparison metric c. So c usually has the form: 



c(/(A:),i?(x),Ji:) = w{x) ■ p{I{x),R{x)) (2) 



where w{x) is a spatial windowing function (rectangular, Gaussian...) that defines the 
extent of the relevant patch of R, and p{I{x), R{x)) is a spatially-invariant image pixel 
comparison metric, e.g., the squared pixel difference ||/(x) — i?(jc)|p for traditional un- 
weighted least squares matching. The “pixels” here may be greyscale, colour, multi-band, 
or even pre-extracted edge, feature or texture maps, so p{) can be quite complicated in 
general, e.g. involving nonlinear changes of luminance or colour space, perceptual or 
sensitivity-based comparison metrics, robust tailing-off at large pixel differences to re- 
duce the influence of outliers, etc. Ideally, p{) should return the negative log likelihood 
for the pixels to correspond, so that (assuming independent noise in each pixel) Q be- 
comes the total negative log likelihood for the patchwise match. For practical inter-image 
template matching, the reliability depends critically on the robustness (large difference 
behaviour) of p(). But for keypoint detection, we always start from the self-matching 
case I=R, so only the local behaviour of p{) near I=R is relevant: keypoint detectors 
are oblivious to large-difference robustification of p{). We will assume that p{) has least- 
squares-like behaviour for small pixel differences, i.e. that it is locally differentiable with 
zero gradient and positive semi-definite Hessian at I=R, so that: 



Sc 

SI(x) 



= 0 , 

I=R 



S^c 
SI (x)2 



> 0 

I=R 



(3) 



Our derivations will be based on 2"^^ order Taylor expansion at I=R, so they exclude 
both non-differentiable Li matching metrics like Sum of Absolute Differences (SAD) 
and discontinuous Lq (on-off) style ones. Our overall approach probably extends to such 
metrics, at least when used within a suitable interpolation model, but their abrupt changes 
and weak resampling behaviour make general derivations difficult. 

Finally, we allow c to be afunctional, not just a function, of I, R. (I.e. a function of the 
local patches, not just their pointwise pixel values). In particular, c may run I, R through 
convolutional filters (‘prefllters’) before comparing them, e.g. to restrict attention to a 
given frequency band in scale-space matching, or simply to suppress high frequencies 
for reduced aliasing and/or low frequencies for better resistance to global illumination 
changes. In general, the resampling implied by /() could significantly change I’s spatial 
frequency content, so prefiltering only makes sense if we do it after warping. We will thus 
assume that prefilters run in x-space, i.e. they are defined relative to the coordinates of 
the reference image R. For example, for affine-invariant keypoint detection [32,24,25], 
keypoint comparison should typically be done, and in particular prefiltering should be 
applied, in the characteristic affine-normalized frame of the reference keypoint, so x 
would typically be taken to be the affine-normalized coordinates for R. For any /(), 
derivatives of the unwarped input image / can always be converted to derivatives of its 
pre filter using integration by parts, so the effective scale of derivative masks always ends 
up being the x- space scale of the pre filter. 



Matching Precision : Now suppose that we have already found a locally optimal template 
match. Consider the behaviour of the matching quality metric Q under small perturbations 
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I I +61. Under 2"'^ order Taylor expansion: 

■>'0 -/ (If W 

For any perturbation of an exact match, / (f (jc) ) = R{x ) , the first order {51) term vanishes 
identically by (3). More generally, if we are already at a local optimum of Q under some 
class of perturbations 51, the integrated first order term vanishes for this class. Both hold 
for keypoints, so we will ignore the 61 term from now on. 

Using the parametric model fi), A), the image I changes as follows under first 
order changes of the motion and appearance parameters /x, A: 

61 Ki LSX + Mdfj,, where L = M = VI T, T = ^ (5) 

Here, VI = ^ (/(jt) ) is the standard gradient of the original unwarped image I, evaluated 
in /’s own frame at f(jc). The columns of the Jacobians L and M can be thought of as 
appearance and motion basis images, characterizing the linearized first-order changes in 
/ as the parameters are varied. Putting (4, 5) together gives a quadratic local cost model 
for perturbations of the match around the optimum, based on a positive semidefinite 
generalized scatter matrix S 



6 Q { SX , Sfj ,) R 




)S( 


' d\ \ 
. SfJ.) 


(6) 


s- (» 7 ?) 


= J (mt) 


d+ 

SR 


( L m) dx 


(7) 



S generalizes the matrix J VI^ VI dx that appears in the Forstner-Harris keypoint 
detector (which assumes pure translation, T = I,M = VI, quadratic pixel difference 

x2 

metric jp = I, and empty illumination model L). To the extent that c gives the negative 
log likelihood for the match, S is the maximum likelihood saddle point approximation to 
the Fisher information matrix for estimating A, /x from the match. I.e., approximates 
the covariance with which the parameters A, /x can be estimated from the given image 
data: the larger S, the stabler the match, in the sense that the matching error 5Q increases 
more rapidly under given perturbations <5A, 5/x. 

Now suppose that we want to ensure that the two patches match stably irrespec- 
tive of appearance changes. For a given perturbation 6p, the appearance change that 
gives the best match to the original patch — and hence that masks the effect of the 
motion as well as possible, thus creating the greatest matching uncertainty — can be 
found by minimizing 6Q{dfi, <5A) w.r.t. ^A. By inspection from (6), this is dX{dfi) = 
B Sfi. Back-substituting into (6) gives an effective quadratic reduced penalty func- 

* Strictly, to be correct to 5A)^) we should also expand (5) to 2““* order, which introduces 

a 2“‘* order ‘tensor’ correction in the SI term of (4). But, as above by (3), the latter term vanishes 
identically for keypoint detection. Even for more general matching, the correction is usually 
negligible unless the match is poor and the motion / appearance models are very nonlinear. One 
can think of (7) as a Gauss-Newton approximation to the true S. ft guarantees that S is at least 
positive semidefinite (as it must be at a locally optimal match). We will adopt it from now on. 
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tion5(5red(^A'') = SQ{6fi,SX{Sfj,)) « ^ Cred^ A* characterizing motion- with-best- 

appearance-adaptation, where the reduced scatter matrix is 

Cred = C-BAB (8) 

with A, B, C as in (7). Cred and C quantify the precision of motion estimation respectively 
with and without appearance adaptation. Some precision is always lost by factoring out 
appearance, so C^d is always smaller than C. To the extent that the matching error metric 
c is a statistically valid log likelihood model for image noise, C ^ and estimate the 
covariances of the corresponding motion parameter estimates under trials with indepen- 
dent noise samples. More generally, if we also have prior information that appearance 
variations are not arbitrary, but have zero mean and covariance the optimal 
becomes — (A -f D)^^B Sfi and C^ed is replaced by the less strongly reduced covariance 
Cd = C-B{A+D)B. 

Keypoint Detection : Ideally, we want to find keypoints that can be stably and reliably 
re-detected under arbitrary motions from the given transformation family t{x,ii), de- 
spite arbitrary changes of appearance from the appearance family I{x, A). We focus on 
the ‘stability’ aspect^, which we characterize in terms of the precision of self-matching 
under our robust template matching model. The idea is that the patch itself is its own 
best template — if it can not be matched stably even against itself, it is unlikely to be 
stably matchable against other patches. We are interested in stability despite appearance 
changes, so we use the reduced scatter matrix Cred (8) to quantify geometric precision. 

The amount of precision that is needed depends on the task, and we adopt the design 
philosophy that visual routines should be explicitly parametrized in terms of objective 
performance criteria such as output accuracy. To achieve this we require keypoints to 
meet a lower bound on matching precision (equivalently, an upper bound on match- 
ing uncertainty). We quantify this by introducing a user-specified criterion matrix Co 
and requiring keypoints to have reduced precisions Cred greater than Co (i.e. Cred — Co 
must be positive semidefinite). Intuitively, this means that for a keypoint candidate to 
be accepted, its transformation- space motion-estimation uncertainty ellipse C“j must be 
strictly contained within the criterion ellipse Cq\ 

In textured images there may be whole regions where this precision criterion is 
met, so for isolated keypoint detection we must also specify a means of selecting ‘the 
best’ keypoint(s) within these regions. This requires some kind of ‘saliency’ or ‘interest’ 
metric, ideally an index of perceptual distinctiveness / reliable matchability modulo our 
appearance model. But here, following the Forstner-Harris philosophy, we simply use 
an index of overall matching precision as a crude substitute for this. In the translation- 
only case, Forstner [7,9] and Harris & Stephens [15] discuss several suitable precision 
indices, based on the determinant, trace and eigenvalues of the scatter matrix. In our case, 
there may be several (more than 2) motion parameters, and eigenvalue based criteria 
seem more appropriate than determinant based ones, owing to their clear links with 

^ We do not consider other matchability properties [7] such distinctiveness here, as this is more 
a matter for the descriptors calculated once the keypoint is found. Distinctiveness is usually 
characterized by probability of mismatch within a population of extracted keypoints (e.g. [33]). 
For a recent entropic approach to image- wide distinctiveness, see [36]. 
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uncertainty analysis. Different motion parameters also have different units (translations 
in pixels, rotations in radians, dilations in log units), and we need to normalize for this. 
The criterion matrix Cq provides a natural scaling, so as our final saliency criterion we 
will take the minimum eigenvalue of the normalized reduced motion precision matrix 
Cg Cred Cg . Intuitively, this requires the longest axis of the motion-estimation 
covariance ellipse, as measured in a frame in which Cg becomes spherical, to be as 
small as possible. With this normalization, the keypoint-acceptability criterion Cred > Cg 
simplifies to the requirement that the saliency (the minimum eigenvalue) must be greater 
than one. Typically, Cg is diagonal, in which case the normalization matrix Cg is the 
diagonal matrix of maximum user-permissible standard errors in translation, rotation and 
scale. 

As usual, pixel sampling effects introduce a small amount of aliasing or jitter in the 
image derivative estimates, which has the effect of spreading gradient energy across the 
various eigenvalues of S even when the underlying image signal is varies only in one 
dimension {e.g. a straight edge). As in the Forstner-Harris case, we compensate for this 
heuristically by subtracting a small user-specified multiple a of the maximum eigenvalue 
of Cg-'/"C,ed Cg (the 1-D ‘straight edge’ signal) before testing for threshold and 
saliency, so our final keypoint saliency measure is Amin — cr Amax- 

In practice, the Schur complement in Cred = C — is calculated simply and 

efficiently by outer-product based partial Cholesky decomposition. A standard symmetric 
eigendecomposition method is then used to calculate the minimum eigenvalue, except 
that 2D eigenproblems are handled as a special case for speed. 



2 Examples of Keypoint Detectors 

Given the above framework, it is straightforward to derive keypoint detectors for specific 
pixel types and motion and appearance models. Here we only consider the simplest few 
motion and appearance models, and we assume greyscale images. 



Comparison Function: As in the traditional Harris detector, we will use simple squared 
pixel difference to compare pixels, and a circular Gaussian spatial integration window. 

x2 

So modulo prefiltering, in (7) reduces to simple weighting by the window function. 



Affine Deformations: For keypoints, only local deformations are relevant, so the most 
general motion model that is useful is probably the affine one. We will use various subsets 
of this, parametrizing affine motions linearly as a :' = x + T fi where : 



Tp 



/I 0 -y X X y\ [ r\ 
VO 1 X y -y x) 



f s-fa —r+b\ f x\ 
\r+b s-a J \yj 




( 9 ) 



Here, {x, y) are window-centred pixel coordinates, {u, v) is the translation, s the scale, 
and for small motions, r is the rotation and a, b are axis- and 45°-aligned quadrupole 
deformations. The resulting M matrix is as follows, where V/ = {Ix, ly)- 



A/ {^Ix dy ylx~\~xly xlx~\~yly xlx ydy ydx~\~xly'j 



( 10 ) 
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If the input image is being prefiltered (which, as discussed, must happen after warping, 
i.e. after (10)), we can integrate by parts to reduce the prefiltered M vector to the form: 

= {IP, IP, -{yI)P+{xI)P, {xI)P+{yI)P-2IP, {xI)P-{yI)P, (t//)£+(a:/)P) 

( 11 ) 

where IP = p * I, {xI)p = py * {xl), etc., denote convolutions of I, xl, etc., against 
the prefilter p and its derivatives Px,Py. The —2 /p term in the s entry corrects for the 
fact that prefiltering should happen after any infinitessimal scale change coded by M : 
without this, we would effectively be comparing patches taken at different derivative 
scales, and would thus overestimate the scale localization accuracy. If p is a Gaussian of 
width cr, we can use (10) or (11) and the corresponding identities {xI)P = xIP + a^IP 
or {xI)P = xIP + a'^I^x + T*’ (from {x—x')g{x—x') = —a^gx{x—x'), etc.) to move 
X, y outside the convolutions, reducing MP to: 

{IP, IP, -yIP+xIP, xIP+yIP+a^Il,+yy, xIP-yIP+aHP,_yy, yIP+xIP + 2a^Py) 

( 12 ) 

Appearance model: Class-specific appearance models like [1,13] can include elaborate 
models of appearance variation, but for generic keypoint detection we can only use simple 
generic models designed to improve resistance to common types of local illumination 
variations. Here, we allow for (at most) a scalar illumination shift, addition of a constant 
spatial illumination gradient, and illumination rescaling. So our linear appearance model 
is I+L A where L{x) is a subset of: 

L{x) = {1 X y I{x)) (13) 

As with M, the elements of L must be prefiltered, but I is just smoothed to IP and l,x,y 
typically have trivial convolutions (e.g., they are unchanged under Gaussian smoothing, 
and hence generate a constant diagonal block diag(l, cr^,, crj,) in S). 

Putting It All Together : The main stages of keypoint detection are :(i) prefilter the input 
image to produce the smoothed image and derivative estimates IP, IP, IP, IP,^, IPy, IPy 
needed for ( 1 2, 1 3) ; (if) for each keypoint location jc, form the outer product matrix of the 
(desired components of the) combined L/M vector at all pixels in its window, and sum over 
the window to produce the scatter matrix S(a:) (7) (use window-centred coordinates for 
X, y in (12Examples of Keypoint Detectorsequation.l2, 13); (Hi) at eachx, reduce S(a:) 
to find Cred(jJ^), normalize by Co, and find the smallest eigenvalue (saliency). Keypoints 
are declared at points where the saliency has a dominant local maximum, i.e. is above 
threshold and larger than at all other points within a suitable non-maximum-suppression 
radius. For multiscale detection, processing is done within a pyramid and keypoints must 
be maxima in both position and scale. As usual, one can estimate subpixel keypoint 
location and scale by quadratic interpolation of the saliency field near its maximum. 
But note that, as in the standard Fdrstner-Harris approach, keypoints do not necessarily 
contain nameable features (corners, spots) that clearly mark their centres — they may 
just be unstructured patches with locally maximal matching stability^. 

^ If well-localized centres are needed, specialized locators exist for specific image structures such 
as spots and comers (e.g. [8]), or more generally one could search for sharp (high-curvature) 
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translation translation + scale (cj translation + rotation similarity 




fej translation / offset (/) translation / offset + gain fgj translation / full f/jJ similarity / full 



Fig. 1. Minimum-eigenvalue strength maps for a popular test image under various motion and 
illumination models. The saliency differences are much larger than they seem: the maps have been 
very strongly gamma compressed, normalized and inverted for better visibility. The prefilter and 
integration windows had a=l pixel, and a = 0. Criterion standard deviations were 1 pixel in 
translation, 1 radian in rotation, \/2 in scale, but these values are not critical. 

When calculating S, instead of separate ab initio summation over each integration 
window, one can also use image-wide convolution of quadratic ‘energies’ as in the stan- 
dard Forstner-Harris detector, but for the more complicated detectors there are many such 
maps to be calculated (76 for the full 10-entry L/M model). See the extended version of 
this paper for details. 

In our current implementation, run times for the full 10-L/M- variable detector (which 
is more than one would normally use in practice) are a factor of about 10 larger than for 
the original two variable Forstner-Harris detector. 

Relation to Zero Mean Matching: This common matching method compares two im- 
age patches by first subtracting each patches mean intensity, then summing the resulting 
squared pixel differences. We can relate this to the simplest nonempty illumination cor- 
rection model, L= (l), whose reduced scatter matrix over window w{x) is: 

Cred = J wM^Mdx-M^M = j w{M-MY{M-M)dx 

M = f w(M)dx / (/ wdxY^“^ (14) 

For the translation-only model, T is trivial, so the illumination correction simply has the 
effect of subtracting from each image gradient its patch mean (c./. (10)). If w changes 
much more slowly than /, 'VI ^ 'VI and hence VI — VI « V (/ — /), so this is 
approximately the same as using the gradient of the bandpassed image /— /. The standard 
Forstner-Harris detector embodies least squares matching, not zero mean matching. It 
is invariant to constant illumination shifts, but it does not subtract the gradient of the 
mean VI (or more correctly, the mean of the gradient VI) to discount the effects of 
smooth local illumination gradients superimposed on the pattern being matched. It thus 

and preferably isolated maxima of the minimum eigenvalue field or local saliency measure, 
not just for high (but possibly broad) ones. For example, a minimum acceptable peak curvature 
could be specified via a second criterion matrix. 
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Fig. 2. Mean predicted standard error (inverse square root of saliency / minimum eigenvalue in 
normalized units) for template matching of keypoints under our motion and lighting models, for 
the model’s top 100 keypoints on the Summer Palace image in Fig. 3. 



systematically overestimates the geometric strength of keypoints in regions with strong 
illumination gradients, e.g. near the borders of smoothly shaded objects, or at the edges 
of shadows. 



3 Experiments 



Fig. 1 shows that the saliency (minimum eigenvalue) map emphasizes different kinds 
of image structures as the motion and illumination models are changed. Image (a) is 
the original Forstner-Harris detector. Images (6) , (c) , (d) successively add scale, rotation 
and scale + rotation motions, while images (e), (/), {g) adjust for illumination offset, 
offset + gain, and offset + gain + spatial gradients. Note the dramatic extent to which 
enforcing rotational stability in (a)— >-(c) and (b)^{d) eliminates the circular dots of 
the calibration pattern. In comparison, enforcing scale stability in (a)— >-(6) and (c)— >-(d) 
has more subtle effects, but note the general relative weakening of the points at the 
summits of the towers between (a) and (6) : straight-edged ‘corners’ are scale invariant, 
and are therefore suppressed. Unfortunately, although ideal axis- and 45° -aligned corners 
are strongly suppressed, it seems that aliasing and blurring effects destroy much of the 
notional scale invariance of most other rectilinear corners, both in real images and in non- 
axis-aligned ideal ones. We are currently working on this problem, which also reduces 
the cross-scale performance of the standard Forstner-Harris detector. 

Adding illumination invariance seems to have a relatively small effect in this example, 
but note the general relative sharpening caused by including x and y illumination gradients 
in (a), (e), {f)^{g)- Points on the borders of intensity edges have enhanced gradients 
owing to the slope alone, and this tends to make them fire preferentially despite the use of 
the minimum-eigenvalue (most uncertain direction) criterion. Subtracting the mean local 
intensity gradient reduces this and hence sharpens the results. However a negative side 
effect of including x, y gradients is that locally quadratic image patches — in particular 
small dots and ridge edges — become much less well localized, as adding a slope to a 
quadratic is equivalent to translating it. 





(a) translation 



(b) translation + rotation (c) translation + scale 



(d) similarity 



Allowing more general motions and/or quotienting out illumination variations always 
reduces the precision of template matching. Fig. 2 shows the extent of this effect hy 
plotting the relative standard errors of template matching for our complete set of motion 
and lighting models, where the matching for each model is performed on the model’s own 
keypoints. There is a gradual increase in uncertainty as parameters are added, the hnal 
uncertainty for a similarity transform modulo the full illumination model being about 2.5 
times that of the original translation-only detector with no illumination correction. 

Fig. 3 shows some examples of keypoints selected using the various different mo- 
tion/lighting models. The main observation is that different models often select different 



(e) translation / offset (f) translation / offset + xy (g) translation / full 



Fig. 3. Examples of keypoints from the CMU and Summer Palace (Beijing) test images, under 
various motion and illumination models. The prefilter and integration windows had a— 2 pixels, 
a = 0, and non-maximum suppression within 4 pixels radius and scale factor 1.8 was applied. 
Note that, e.g., ‘affine’ means ‘resistant to small affine deformations’, not affine invariant in the 
sense of [32,24,25]. 
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keypoints, and more invariant models generate fewer of them, bnt beyond this it is difficult 
to find easily interpretable systematic trends. As in the Fdrstner-Harris case, keypoints 
are optimized for matching precision, not for easy interpretability in terms of idealized 
image events. 



4 Summary and Conclusions 

Summary: We have generalized the Forstner-Harris detector [7,9,15] to select key- 
points that provide repeatable scale and orientation, as well as repeatable position, over 
re-detections, even in the face of simple local illumination changes. Keypoints are se- 
lected to maximize a minimnm-eigenvalue-based local stability criterion obtained from a 
second order analysis of patch self-matching precision under affine image deformations, 
compensated for linear illumination changes. 

Future Work: The approach given here ensures accurate re-localizability (by inter- 
image template matching) of keypoint image patches under various transformations, but 
it does not always provide accurate ‘centres’ for them. To improve this, we would like to 
characterize the stability and localization accuracy of the local maxima of the saliency 
measure (minimum eigenvalue) under the given transformations. In other words, just 
as we derived the local transformational-stability matrix Cred(j;^) for matching from the 
scalar matching metric <5(x), we need to derive a local transformational-stability matrix 
for saliency from the scalar saliency metric. Only here, the saliency measure is already 
based on matching stability, so a second level of analysis will be needed. 
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Abstract. Although inexact graph-matching is a problem of potentially 
exponential complexity, the problem may be simplified by decomposing 
the graphs to be matched into smaller subgraphs. If this is done, then the 
process may cast into a hierarchical framework and hence rendered suit- 
able for parallel computation. In this paper we describe a spectral method 
which can be used to partition graphs into non-overlapping subgraphs. 
In particular, we demonstrate how the Fiedler- vector of the Laplacian 
matrix can be used to decompose graphs into non-overlapping neighbour- 
hoods that can be used for the purposes of both matching and clustering. 



1 Introduction 

Graph partitioning is concerned with grouping the vertices of a connected graph 
into subsets so as to minimize the total cut weight [6] . The process is of central 
importance in electronic circuit design, map coloring and scheduling [19]. How- 
ever, in this paper we are interested in the process since it provides a means 
by which the inexact graph-matching problem may be decomposed into a se- 
ries of simpler subgraph matching problems. As demonstrated by Messmer and 
Bunke [9], error-tolerant graph matching can be simplified using decomposition 
methods and reduced to a problem of subgraph indexing. Our aim in this paper 
is to explore whether spectral methods can be used to partition graphs in a 
stable manner for the purposes of matching by decomposition. 

Recently, there has been increased interest in the use of spectral graph theory 
for characterising the global structural properties of graphs. Spectral graph the- 
ory aims to summarise the structural properties of graphs using the eigenvectors 
of the adjacency matrix or the Laplacian matrix [2]. There are several examples 
of the application of spectral matching methods for grouping and matching in 
the computer vision literature. For instance, Umeyama has shown how graphs of 
the same size can be matched by performing singular value decomposition on the 
adjacency matrices [16]. Here the permutation matrix that brings the nodes of 
the graphs into correspondence is found by taking the outer product of the ma- 
trices of left eigenvectors for the two graphs. In related work Shapiro and Brady 
[13] have shown how to locate feature correspondences using the eigenvectors 
of a point-proximity weight matrix. These two methods fail when the graphs 
being matched contain different numbers of nodes. However, this problem can 
be overcomed by using the apparatus of the EM algorithm [8,18]. More recently, 
Shokoufandeh, Dickinson, Siddiqi and Zucker [15] have shown how graphs can 
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be retrieved efficiently using an indexing mechanism that maps the topological 
structure of shock-trees into a low-dimensional vector space. Here the topological 
structure is encoded by exploiting the interleaving property of the eigenvalues. 

One of the most important spectral attributes of a graph is the Fiedler vector, 
i.e. the eigenvector associated with the second smallest eigenvalue of the Lapla- 
cian matrix. In a useful review, Mohar [11] has summarized some important 
applications of Laplace eigenvalues such as the max-cut problem, semidefinite 
programming and steady state random walks on Markov chains. More recently, 
Haemers [5] has explored the use of interlacing properties for the eigenvalues 
and has shown how these relate to the chromatic number, the diameter and 
the bandwidth of graphs. In the computer vision literature, Shi and Malik [14] 
have used the Fiedler vector to develop a recursive partition scheme and have 
applied this to image grouping and segmentation. The Fiedler vector may also 
be used for the Minimum Linear Arrangement problem(MinLA) which involves 
placing the nodes of a graph in a serial order which is suitable for the purposes 
of visualisation [3] . 

An extension of the minimum linear arrangement problem is the seriation 
problem which involves finding a serial ordering of the nodes, which maximally 
preserves the edge connectivity. This is clearly a problem of exponential com- 
plexity. As a result approximate solution methods have been employed. These 
involve casting the problem in an optimization setting. Hence techniques such as 
simulated annealing and mean field annealing have been applied to the problem. 
It may also be formulated using semidefinite programming, which is a technique 
closely akin to spectral graph theory since it relies on eigenvector methods. How- 
ever, recently a graph-spectral solution has been found to the problem. Atkins, 
Boman and Hendrikson [1] have shown how to use the Fiedler eigenvector of 
the Laplacian matrix to sequence relational data. The method has been success- 
fully applied to the consecutive ones problem and a number of DNA sequencing 
tasks. There is an obvious parallel between this method and steady state ran- 
dom walks on graphs, which can be located using the leading eigenvector of the 
Markov chain transition probability matrix. However, in the case of a random 
walk the path is not guaranteed to encourage edge connectivity. The spectral 
seriation method, on the other hand, does impose edge connectivity constraints 
on the recovered path. 

The aim in this paper is to consider whether the partitions delivered by the 
Fiedler vector can be used to simplify the graph-matching problem. We focus 
on two problems. The first of these is to use the Fiedler vector to decompose 
graphs by partitioning them into structural units. Our aim is to explore whether 
the partitions are stable under structural error, and in particular whether they 
can be used for the purposes of graph-matching. The second problem studied 
is whether the partitions can be used to simplify the graphs in a hierarchical 
manner. Here we construct a graph in which the nodes are the partitions and 
the edges indicate whether the partitions are connected by edges in the original 
graph. This spectral construction can be applied recursively to provide a hier- 
archy of simplified graphs. We show that the simplified graphs can be used for 
efficient and reliable clustering. 
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2 Laplacian Matrix and Fiedler Vector 



Consider the unweighted graph G = {V, E) where V is the set of nodes and E 
is the set of edges. The adjacency matrix of the graph is A, and has elements 






if {i,j) G E 
otherwise 



( 1 ) 



The weighted adjacency matrix is denoted by W. 

The degree matrix of the graph is the diagonal matrix D = diag{deg{i); i G 
V) where the degree is the row-sum of the adjacency matrix deg{i) = J2jev A{i,j). 
With these ingredients the Laplacian matrix L = D — A has elements 



(Y.{i,k)GEMhk) Ai = j 

L{hj) = < -A{i,j) j and (z, j) G E (2) 

I 0 otherwise 



The Laplacian matrix has a number of important properties. It is symmet- 
ric and positive semidefinite. The eigenvector (1, 1, . . . , 1)^ corresponds to the 
trivial zero eigenvalue. If the graph is connected then all other eigenvalues are 
positive and the smallest eigenvalue is a simple one, which means the number of 
connected components of the graph is equal to the multiplicity of the smallest 
eigenvalue. If we arrange all the eigenvalues from the smallest to the largest i.e. 
0 < Ai < A 2 . . . < A„, the most important are the largest eigenvalue Xmax and 
the second smallest eigenvalue A 2 , whose corresponding eigenvector is referred 
to as the Fiedler Vector [4]. 

Our aim is to decompose the graph into non-overlapping neighbourhoods 
using a path-based seriation method. The aim is to find a path sequence for the 
nodes in the graph using a permutation tt. The permutation gives the order of 
the nodes in the sequence. The sequence is such that the elements of the edge 
weight matrix W decrease as the path is traversed. Hence, if 7 t(z) < 7r(j) < 7r(k), 
then W(i,j) > W(i,k) and W(j,k) > W(i,k). This behaviour can be captured 
using the penalty function 



Tl iv^l 

ffW = ^^IT(z,j)(7r(t) -7r(j))^ 

i=i 

By minimizing (/(tt) it is possible to find the permutation that minimizes 
the difference in edge weight between adjacent nodes in the path, and this in 
turn sorts the edge weights into magnitude order. Unfortunately, minimizing 
g(7r) is potentially NP complete due to the combinatorial nature of the discrete 
permutation tt. To overcome this problem, a relaxed solution is sought that 
approximates the structure of g(7r) using a vector x = (xi,X 2 , ■■■■) of continuous 
variables Xi. Hence, the penalty function considered is 

IV'I |V| 

j=l 
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The value of g{x) does not change if a constant amount is added to each of the 
components Xi. Hence, the minimization problem must be subject to constraints 
on the components of the vector x. The constraints are that 

IV'I |V| 

= 1 and Xi = Q (3) 

i=l i=l 

The solution to this relaxed problem may be obtained from the Laplacian matrix. 
Ife=(l,l,l...,l)^is the all-ones vector, then the solution to the minimization 
problem is the vector 

X = arg min x^Lxt = arg min lT(i, 

.e—O.xJ^x^—l x^ .e.—O^x'F X:t,—1 ' ^ 

t>j 

When W is positive definite, then the solution is the Fiedler vector, i.e. 
the vector associated with the smallest non-zero eigenvalue of L. In fact, the 
associated eigenvalue minimizes the Rayleigh quotient 

. xJLx^ 

A = arg mm — 

/V. /y* -L rn 



3 Graph Partition 

The aim in this paper is to use the Fiedler vector to partition graphs into non- 
overlapping structural units and to use the structural units generated by this 
decomposition for the purposes of graph-matching and graph-simplification. 



3.1 Decomposition 

The neighbourhood of the node i consists of its center node, together with 
its immediate neighbors connected by edges in the graph, i.e., Ni = {f} U 
{u; (i,u) G E}. An illustration is provided in Figure 1, which shows a graph 
with two of its neighbourhoods highlighted. Hence, each neighbourhood consists 
of a center node and immediate neighbors of the center node, i.e. Ni = Ni \ {f}. 

The problem addressed here is how to partition the graph into a set of non- 
overlapping neighbourhoods using the node order defined by the Fiedler vector. 
Our idea is to assign to each node a measure of significance as the centre of a 
neighbourhood. We then traverse the path defined by the Fielder vector selecting 
the centre-nodes on the basis of this measure. 

We commence by assigning weights to the nodes on the basis of the rank- 
order of their component in the Fiedler vector. Let E =< ji,j 2 ,j 3 , > be the 

rank-order of the nodes as defined by the Fiedler vector so that the permutation 

satisfies the condition n{ji) < 7r(j2) < and the components of the 

Fiedler vector follow the condition > Xj^ > .. > We assign weights to 

the nodes based on their rank order in the permutation. The weight assigned to 
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the node i & V is wi = Rank (7T(j)). With this weighted graph in hand, we can 
gauge the significance of each node using the following score function: 

Ri = a{deg{i) + \Nir\P\) + — (4) 

Wi 

where P is the set of nodes on the perimeter of the graph. The first term depends 
on the degree of the node and its proximity to the perimeter. Hence, it will 
sort nodes according to their distance from the perimeter. This will allow us to 
partition nodes from the outer layer first and then work inwards. The second 
term ensures that the first ranked nodes in the Fielder vector are visited first. 

We use the score function to locate the non-overlapping neighbourhoods of 
the graph G. We traverse this list until we find a node k\ which is neither in 
the perimeter, i.e. ki ^ P nor whose score exceeds those of its neighbours, i.e. 
Pki = argmaxigfcjUAffcj iRi- When this condition is satisfied, then the node k\ 
together with its neighbours represent the first neighbourhood. The set of 
nodes iV^i = kiU are appended to a list T that tracks the set of nodes 
assigned to the neighbourhoods. This process is repeated for all the nodes which 
have not yet been assigned to a neighbourhood i.e. R = P — T . The procedure 
terminates when all the nodes of the graph have been assigned to non-overlapping 
neighbourhood . 



4 Matching 

We match the graphs using the non-overlapping neighbourhoods detected using 
the Fiedler vector. Consider a data graph Gd which is to be matched onto a 
model graph Gm- The state of correspondence match can be represented by the 
function f '.Vd ^ Vm U {<P} from the node-set of the data graph onto the node- 
set of the model graph, where the node-set of the model graph is augmented by 
adding a NULL label, <P, to allow for unmatchable nodes in the data graph. Our 
score function for the match is the average over the matching probabilities for 
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the set of neighbourhoods of the graph 

QoMf) = E ( 5 ) 

^ i&VD 

where Fi = (/(mq), /( rti), denotes the relational image of the 

neighbourhood N[* in Gd under the matching function /. 

We use the Bayes rule to compute the matching probability over a set of 
legal structure-preserving mappings between the data and model graphs. The 
set of mappings is compiled by considering the cyclic permutations of the pe- 
ripheral nodes about the centre node of the neighbourhood. The set of feasi- 
ble mappings generated in this way is denoted by 0i = {S'} which consists of 

structure-preserving mapping of the form S = (sq, Si, .., s„, ,S|^d|), where 

s„ G j U (w; {j,v) G U ^ is either one of the node-labels drawn from the 

model graph neighbourhood or the null-label and u € Nf is one of the node- 
labels drawn from the data graph neighbourhood . 

With the structure preserving mappings to hand we use the Bayes formula 
to compute the matching probability, P{Fi). This is done by expanding over the 
set of structure preserving mappings Oi in the following manner 

P{F,) = Y. P{F,\S) ■ P{S) (6) 

se0i 

We assume a uniform distribution of probability over the structure preserving 
mappings and write P{S) = The conditional matching probability P{Fi\S) 
is determined by comparing every assigned match f{u) in the configuration Fi 
with the corresponding item s„ in the structure preserving mapping S. 

4.1 Edit Distance 

To model the structural differences in the neighbourhoods, we use the Leven- 
shtein or string edit distance [7,17,12]. This models structural error by consider- 
ing insertions and deletions, in addition to relabelling. In what follows, the set of 
structure preserving mappings Of which contains only cyclic permutations and 
whose size is therefore equal to \N[’\ — 1. 

Let X and Y be two strings of symbols drawn from an alphabet X. We wish 
to convert X to Y via an ordered sequence of operations such that the cost 
associated with the sequence is minimal. The original string to string correction 
algorithm defined elementary edit operations, (a, b) yf (e, e) where a and b are 
symbols from the two strings or the NULL symbol, e. Thus, changing symbol x 
to y is denoted by (x,y), inserting y is denoted (e,y), and deleting x is denoted 
(x,e). A sequence of such operations which transforms X into Y is known as 
an edit transformation and denoted A =< <5i, ..., >. Elementary costs are 

assigned by an elementary weighting function 7 : A U (ej x A U (ej >->■ Ji; the 
cost of an edit transformation, C(A), is the sum of its elementary costs. The 
edit distance between X and Y is defined as 



d(X, y) = min{C(Z\)|Z\ transforms A to Uj 



( 7 ) 
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In [10], Marzal and Vidal introduced the notion of an edit path which is a se- 
quence of ordered pairs of positions in X and Y such that the path monotonically 
traverses the edit matrix of x and y from (0, 0) to (jVj , |F|). 

Essentially, the transition from one point in the path to the next is equivalent 
to an elementary edit operation: (a, 6) — >■ (a -I- 1, &) corresponds to deletion of the 
symbol in X at position a. Similarly, (a, b) — >■ {a,b+ 1) corresponds to insertion 
of the symbol at position b in Y. The transition (a, 6) — >■ (a -I- 1, 6 -I- 1) corre- 
sponds to a change from X (a) to Y (b). Thus, the cost of an edit path, C (P), 
can be determined by summing the elementary weights of the edit operations 
implied by the path. 

d (V, Y) = min {C {P \P is an edit path from X to Y)} (8) 



4.2 Matching Probabilities 

If we replace X and V by a structure preserving mapping. Si, and the image 
of a data graph neighbourhood under the match, Fj, we can see that Fj could 
have arisen from S through the action of a memoryless error process, statistically 
independent of position (since the errors that ’’transformed” S to Fj could have 
occured in any order). So we can factorize (6) over the elementary operations 
implied by the edit path P* 

P{Fj\Si)= n (9) 



where (/ (u) , v) is an insertion, a deletion, a change or an identity operation im- 
plied by the edit path Pp.^Si between the neighbourhood Fj and the unpadded 
structure preserving mapping Si. The role of the edit distance here is to obtain 
each operation instead of calculating the whole cost. We can trace every sin- 
gle operation by back tracking the edit matrix. For simplicity, we assume that 
different edit operations have identical cost, for example, 1. But this does not 
influence the probability because it is the probabilities of the transitions in the 
path which contribute to the matching prior not the edit weights themselves 
although they will determine the magnitude of the minimum cost. 






0 if (/ (u ) , v) is an identity 

1 otherwise 



So, the probability for the edit operation given to each pair is: 



P(f(u)lv) 



(1 — Pg) if (/ (u ) , w) is an identity 
Pe otherwise 



(10) 



( 11 ) 



If we define the number of nonidentity transformations in the edit path to be 
W fpf.g.], the matching probability of Fj can be given: 



P{Fj) 



3 

| 0 | 



^ exp 

SiS 0 




(12) 



= (1_P^)KI andX„ 



= In 



(i-Pd 

Pe 



where Kjqo 
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5 Hierarchical Simplification 

The neighbourhoods extracted using the Fiedler vector may also be used to 
perform hierarchical graph simplification. 



5.1 Partition Arrangements 

Our simplification process proceeds as follows. We create a new graph in which 
each neighbourhood = {i} U {■u;(z,'u) G E} is represented by a node. In 
practice this is done by eliminating those nodes, which are not the center nodes 
of the neighbourhoods W = Ni\{i}. In other words, we select the center node of 
each neighbourhood to be node-set for the next level representation. The node set 
is given by V = |iVi \ TVi, IV2 \ A2, . . . , \ A„|. Our next step is to construct 

the edge-set for the simplified graph. We construct an edge between two nodes if 
there is a common edge contained within their associated neighbourhoods. The 
condition for the nodes i G V and j € V to form an edge in the simplified graph 
G= (V,E) is (i,j) €E^ l^iOiVjl > 2 . 



5.2 Clustering 

To provide an illustration of the usefulness of the simplifications provided by the 
Fiedler vector, we focus on the problem of graph clustering. The aim here is to 
investigate whether the simplified graphs preserve the pattern space distribution 
of the original graphs. There are a number of ways in which we could undertake 
this study. However, here we use a simple graph-spectral clustering method, 
which is in keeping with the overall philosophy of this paper. 

Suppose that we aim to cluster the set of M graphs {Gi, ...Gk, ....Gm} We 
commence by performing the spectral decomposition on the Lapla- 

cian matrix Lk for the graph indexed k, where Ak = diag{\\, ...) is the diag- 
onal matrix of eigenvalues and <Pk is a matrix with eigenvectors as columns. For 
the graph Gk, we construct a vector Bk = (A^, A^, . . . , A™) from the leading m 
eigenvalues. We can visualize the distribution of graphs by performing multidi- 
mensional scaling (MDS) on the matrix of distances dki,k2 between graphs. This 
distribution can be computed using either the edit distance technique used in 
the previous section where dki,k2 = —l'nQki^k2 or by using the spectral features 
where dki,k2 = {Bki - Bk2)^ (Bfci - Bk2)- 

6 Experiments 

The aims in this section are twofold. First, we aim to illustrate that the neigh- 
bourhoods delivered by the Fiedler vector form useful structural units for com- 
puting edit distance. Second, we aim to illustrate that the simplification proce- 
dure results in a stable distribution of graphs in pattern-space. 
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6.1 Real- Word Data 

The data used in our study is furnished by a sequence of views of a model- 
houses taken from different camera directions. In order to convert the images into 
abstract graphs for matching, we extract point features using a corner detector. 
Our graphs are the Delaunay triangulations of the corner-features. 

We have matched the first image to each of the subsequent images in the 
sequence by using the edit distance method outlined earlier in this paper. The 
results are compared with those obtained using the method of Luo and Han- 
cock [8] in Table 1. This table contains the number of detected corners to be 
matched, the number of correct correspondence, the number of missed corners 
and the number of miss-matched corners. 

Figure 2 shows us the correct correspondence rate as a function of view 
difference for the two methods based on the data in the Table 1. From the 
results, it is clear that our new method degrades gradually and out performs [8] 



Table 1. Correspondence allocation results and comparison with the EM method. 



Method 


House index 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Corners 


30 


32 


32 


30 


30 


32 


30 


30 


30 


31 


EM [8] 


Correct 


- 


29 


26 


24 


17 


13 


11 


5 


3 


0 


False 


- 


0 


2 


3 


8 


11 


12 


15 


19 


24 


Missed 


- 


1 


2 


3 


5 


6 


7 


10 


8 


6 


Edit 

Distance 
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26 
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20 


19 


17 


14 


11 


13 


11 


False 


- 


3 
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11 


12 


16 


15 


17 


19 


Missed 


- 


1 


1 


2 


0 


1 


0 


4 


0 


0 




View difference 



Fig. 2. Comparison of results 
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when the difference in viewing angle is large. Even in the worst case, our method 
has a correct correspondence rate of 36%. 

6.2 Graph Clustering 

Our graph clustering experiments are performed with three different sequences 
of model houses. In Figure 3 the two panels show the distances d{ki,k 2 ) = 

between the vectors of eigenvalues for the graphs 
indexed fci and k 2 - The left panel is for the original graph and the right panel is 
for the simplified graph. It is clear that the simplification process has preserved 
much of the structure in the distance plot. For instance, the three sequences 
are clearly visible as blocks in the panels. Figure 4 shows a scatter plot of the 
distance between the simplified graphs (y-axis) as a function of the distance 
between the original graphs. Although there is considerable dispersion, there is 
an underlying linear trend. 

Figure 5 and 6 repeat the distance matrices and the scatter plot using edit 
distance rather than the L2 norm for the spectral feature vectors. Again, there 





Fig. 3. Pairwise spectral graph distance; (left) original graph, (right) reduced graph 



++ 

+ ++ + 




Fig. 4. Scatter plot for the original graph and reduced graph pairwise distance 
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Fig. 5. Graph edit distance; (left) original graph, (right) reduced graph 




Fig. 6. Scatter plots for the original graph and reduced graph edit distance 



is a clear block structure. However, the dispersion in the scatter plot is greater. 
To take this study one step further, in Figure 7 and 8 we show the result of 
performing MDS on the distances for both the edit distance and the spectral 
feature vector. In both cases the different views of the houses fall into distinct 
regions of the plot. Moreover, the reduction press does not destroy the cluster 
structure. 

7 Conclusions 

In this paper, we have used the Fiedler vector of the Laplacian matrix to parti- 
tion the nodes of a graph into structural units for the purposes of matching. This 
allows us to decompose the problem of matching the graphs into that of match- 
ing structural subunits. We investigate the matching of the structural subunits 
using a edit distance method. The partitioning method is sufficiently stable un- 
der structural error that accuracy of match is not sacrificed. Our motivation in 
undertaking this study is to use the partitions to develop a hierarchical match- 
ing method. The aim is to construct a graph that represents the arrangement of 





Spectral Simplification of Graphs 



125 





Fig. 7. MDS for the original graph (left) edit distance, (right)spectral feature vector 




Fig. 8. MDS for the reduced graph (left) edit distance, (right)spectral feature vector 



the partitions. By first matching the partition arrangement graphs, we provide 
constraints on the matching of the individual partitions. 
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Abstract. We introduce a novel approach to the cerebral white mat- 
ter connectivity mapping from diffusion tensor MRI. DT-MRI is the 
unique non-invasive technique capable of probing and quantifying the 
anisotropic diffusion of water molecules in biological tissues. We address 
the problem of consistent neural fibers reconstruction in areas of com- 
plex diffusion profiles with potentially multiple fibers orientations. Our 
method relies on a global modelization of the acquired MRI volume as a 
Riemannian manifold M and proceeds in 4 majors steps: First, we estab- 
lish the link between Brownian motion and diffusion MRI by using the 
Laplace-Beltrami operator on M . We then expose how the sole knowl- 
edge of the diffusion properties of water molecules on M is sufficient to 
infer its geometry. There exists a direct mapping between the diffusion 
tensor and the metric of M. Next, having access to that metric, we pro- 
pose a novel level set formulation scheme to approximate the distance 
function related to a radial Brownian motion on M. Finally, a rigorous 
numerical scheme using the exponential map is derived to estimate the 
geodesics of M, seen as the diffusion paths of water molecules. Numerical 
experimentations conducted on synthetic and real diffusion MRI datasets 
illustrate the potentialities of this global approach. 



1 Introduction 

Diffusion imaging is a magnetic resonance imaging technique introduced in the 
mid 1980s [1], [2] which provides a very sensitive probe of biological tissues ar- 
chitecture. Although this method suffered, in its very first years, from severe 
technical constraints such as acquisition time or motion sensitivity, it is now 
taking an increasingly important place with new acquisition modalities such as 
ultrafast echo-planar methods. In order to understand the neural fibers bundle 
architecture, anatomists used to perform cerebral dissection, strychnine or chem- 
ical markers neuronography [3]. As of today, diffusion MRI is the unique non- 
invasive technique capable of probing and quantifying the anisotropic diffusion 
of water molecules in tissues like brain or muscles. As we will see in the following, 
the diffusion phenomenon is a macroscopic physical process resulting from the 
permanent Brownian motion of molecules and shows how molecules tend to move 
from low to high concentration areas over distances of about 10 to 15 fxm during 
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typical times of 50 to 100 ms. The key concept that is of primary importance 
for diffusion imaging is that diffusion in biological tissues reflects their structure 
and their architecture at a microscopic scale. For instance, Brownian motion is 
highly influenced in tissues such as cerebral white matter or the annulus fibrosus 
of inter-vertebral discs. Measuring, at each voxel, that very same motion along 
a number of sampling directions (at least 6, up to several hundreds) provides an 
exquisite insight into the local orientation of fibers and is known as diffusion- 
weighted imaging. In 1994, Basser et al. [4] proposed the model, now widely 
used, of the diffusion tensor featuring an analytic means to precisely describe 
the three-dimensional nature of anisotropy in tissues. 

Numerous works have already addressed the problem of the estimation and 
regularization of these tensor fields. References can be found in [5], [6], [7], [8], [9]. 
Motivated by the potentially dramatic improvements that knowledge of anatom- 
ical connectivity would bring into the understanding of functional coupling be- 
tween cortical regions [10], the study of neurodegenerative diseases, neurosurgery 
planning or tumor growth quantification, various methods have been proposed 
to tackle the issue of cerebral connectivity mapping. Local approaches based 
on line propagation techniques [11], [12] provide fast algorithms and have been 
augmented to incorporate some natural constraints such as regularity, stochas- 
tic behavior and even local non-Gaussianity ([13], [14], [15], [16], [17], [18], [19], 
[20]). All these efforts aim to overcome the intrinsic ambiguity of the diffu- 
sion tensor related to white matter partial volume effects. Bearing in mind this 
limitation, they enable us to generate relatively accurate models of the human 
brain macroscopic three-dimensional architectures. The tensor indeed encapsu- 
lates the averaged diffusion properties of water molecules inside a voxel whose 
typical extents vary from 1 to 3 mm. At this resolution, the contribution to 
the measured anisotropy of a voxel is very likely to come from different fibers 
bundles presenting different orientations. This voxel- wise homogeneous Gaussian 
model thus limits our capacity to resolve multiple fibers orientations since local 
tractography becomes unstable when crossing artificially isotropic regions char- 
acterized by a planar or spherical diffusion profile [8]. On the other side, new 
diffusion imaging methods have been recently introduced in an attempt to better 
describe the complexity of water motion but at the cost of increased acquisition 
times. This is a case of high angular diffusion weighted imaging [21], [22] where 
the variance of the signal could give important information on the multimodal 
aspect of diffusion. Diffusion Spectrum Imaging [23], [24] provides, at each voxel, 
an estimation of the probability density function of water molecules and has been 
shown to be a particularly accurate means to access the whole complexity of the 
diffusion process in biological tissues. In favor of these promising modalities, par- 
allel MRI [25] will reduce the acquisition time in a near future and thus permit 
high resolution imaging. 

More global algorithms such as [26] have been proposed to better handle the 
situations of false planar or spherical tensors (with fibers crossings) and to pro- 
pose some sort of likelihood of connection. In [27], the authors make use of the 
major eigenvector field and in [28] the full diffusion tensor provides the metric of 
a Riemannian manifold but this was not exploited to propose intrinsic schemes. 
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We derive a novel approach to white matter analysis, through the use of stochas- 
tic processes and differential geometry which yield physically motivated distance 
maps in the brain, seen as a 3-manifold and thus the ability to compute intrin- 
sic geodesics in the white matter. Our goal is to recast the challenging task of 
connectivity mapping into the natural framework of Riemannian differential ge- 
ometry. Section 2 starts from the very definition of Brownian motion and show 
its link to the diffusion MRI signal for linear spaces in terms of its probability 
density function. Generalization to manifolds involves the introduction of the 
infinitesimal generator of the Brownian motion. We then solve, in Section 3, the 
problem of computing the intrinsic distance function from a starting point xq in 
the white matter understood as a manifold. The key idea is that the geometry 
of the manifold M has a deep impact on the behavior of Brownian motion. We 
claim that the diffusion tensor can be used to infer geodesic paths on M that 
coincide with neural tracts since its inverse defines the metric of M. Practically, 
this means that, being given any subset of voxels in the white matter, we will 
be able to compute paths most likely followed by water molecules to reach xq. 
As opposed to many methods developed to perform tractography, we can now 
exhibit a bunch of fibers starting from a single point Xq and reaching poten- 
tially large areas of the brain. Efficient numerical implementation is non-trivial 
and described in Section 4. Results, advantages and drawbacks of the method 
are presented and discussed in Section 5. We conclude and present potential 
extensions in Section 6. 

2 Prom Molecular Diffusion to Anatomical Connectivity 

2.1 The Diffusion MRI Signal 

Diffusion MRI provides the only non-invasive means to characterize molecular 
displacements, hence its success in physics and chemistry. To measure diffusion 
in several directions, the Stejskal-Tanner imaging sequence is widely used. It 
basically relies on two strong gradient pulses positioned before and after the 
refocusing 180 degrees pulse of a classical spin echo sequence to control the 
diffusion weighting. For each slice, at least 6 independent gradient directions 
and 1 unweighted image are acquired to be able to estimate the diffusion tensor 
D and probe potential changes of location of water molecules due to Brownian 
motion. By performing one measurement without diffusion weighting Sq and 
one (S') with a sensitizing gradient g, the diffusion coefficient D along g can be 
estimated through the relation: 

S = Soexp(-72j2 (Zi _ j/3) |g|2^) (1) 

where S is the duration of the gradient pulses, A the time between two gradient 
pulses and 7 the gyromagnetic ratio of the hydrogen proton. 

2.2 Brownian Motion and Anisotropic Molecular Diffusion 

We recall the definition of a Brownian motion in Euclidean space, the simplest 
Markov process whose stochastic behavior is entirely determined by its initial 
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distribution fj, and its transition mechanism. Transitions are described by a prob- 
ability density function p or an infinitesimal generator C. In linear homogeneous 
spaces, p is easily derived as the minimal fundamental solution associated with 
£ (solution of equation 2). On manifolds, constructing this solution is a tough 
task, but for our problem, we only need to characterize C. Further details can 
be found in [29]. We denote by = C([0, oo[— the set of d-dimensional 
continuous functions and by the topological cr-algebra on Then, 

Definition 1. A d-dimensional continuous process X is a 'V'^-valued random 
variable on a probability space P) 

By introducing the time t G [0, oo[ such that Vv G V‘^, v{t) G a time-indexed 
collection {Xt(w)}, Vw G O generates a d-dimensional continuous process if Xt 
is continuous with probability one. A Brownian motion is characterized by: 

Definition 2. With p a probability on ,B(]R‘^)), Xtg,Xt^ — Xtg, ...,Xt^ — 
mutually independent with initial distribution specified by p and Gaussian 
distribution for subsequent times (ti are nonnegative and increasing) , a process 
Xt is called a d-dimensional Brownian motion with initial distribution p. 

Xt describing the position of water molecules, we now would like to under- 
stand how the diffusion behavior of these molecules is related to the underlying 
molecular hydrodynamics. Diffusion tensor, as thermal or electrical conductivity 
tensors, belongs to the broader class of general effective property tensors and 
is defined as the proportionality term between an averaged generalized inten- 
sity B and an averaged generalized flux F . In our particular case of interest 
B is the concentration gradient VC and F is the mass flux J such that Pick’s 
law holds: J = — DVC. By considering the conservation of mass, the general 
diffusion equation is readily obtained: 

r)G 

— = V.(DVC) = CC (2) 

In anisotropic cerebral tissues, water molecules motion varies in direction de- 
pending on obstacles such as axonal membranes. The positive definite order-2 
tensor D has been related [30] to the root mean square of the diffusion distance 
by D = denotes an ensemble average). This is directly 

related to the minimal fundamental solution of equation 2 for an unbounded 
anisotropic homogeneous medium and the regular Laplacian with initial distri- 
bution (obeying the same law as concentration) limt_>op(a;ja;o,t) = 5{x — Xg): 



p(xlxo,t) = 



\ (d/2) 



47rlD]t^ 



exp 



-(a:-a;o)^D ^{x-xg) 
At 



Also known as the propagator, it describes the conditional probability to find 
a molecule, initially at position xg, at x after a time interval t. All the above 
concepts find their counterparts when moving from linear spaces, such as 
to Riemannian manifolds. Explicit derivation of p is non-trivial in that case and 
the Laplace-Beltrami operator, well known in image analysis [31], will be of 
particular importance to define £. 
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3 White Matter as a Riemannian Manifold 



3.1 Geometry of a Manifold from Diffusion Processes 



We now want to characterize the anisotropic diffusion of water molecules in 
the white matter exclusively in term of an appropriate infinitesimal generator 
H. Brownian motions are characterized by their Markovian property and the 
continuity of their trajectories. They have been, so far, generated from their 
initial distribution /x and their transition density function p, but they are char- 
acterized in terms of £-diffusion processes. Without any further detail, we claim 
that under some technical hypothesis on C (with its domain of definition D{C)) 
and on the Brownian motion it is possible to define an £-diffusion process 
on a Riemannian manifold M from the d-dimensional stochastic process Xt- 
We refer the interested reader to [29]. We focus, as in [32], on the case of a 
diffusion process with time-independent infinitesimal generator C, assumed to 
be smooth and non-degenerate elliptic. We introduce Am the Laplace-Beltrami 
differential operator such that, for a function / on a Riemannian manifold M, 
Am/ = div(grad/). In local coordinates xi,X 2 , the Riemannian met- 

ric writes in the form ds^ = gijdxidxj and the Laplace-Beltrami operator be- 
comes 



1 d 



VG dxj 



Am fix) = 4=^ ( ] = g^^ix) 



dxi 



d^f 

' dxjdx. 



(x) +b\x)^{x) 



where G is the determinant of the matrix {gij} and {g^^} its inverse. More- 
over, 



1 djVGg^^) 

\[G dxj 

where are the Christoffel symbols of the metric {gij}. Am is second order, 
strictly elliptic. At that point of our analysis, it turns out that constructing 
the infinitesimal generator C of our diffusion process boils down to (see [33]): 






Definition 3. The operator C is said to he an intrinsic Laplacian generating a 
Brownian motion on M if C = \Am- 

Thus, for a smooth and non-degenerate elliptic differential operator on M of the 
form: C = ^d^^ (x) we have the 

Lemma 1. If {dijix))ij=i,,,d denotes the inverse matrix of {x))ij=i,,,d, then 
g = dijdxidxj defines a Riemannian metric g on M . 

Conclusion: In the context of diffusion tensor imaging, this is of great impor- 
tance for the following since it means that the diffusion tensor D estimated at 
each voxel actually defines, after inversion, the metric of the manifold. We have 
made the link between the diffusion tensor data and the white matter manifold 
geometry through the properties of Brownian motion. 
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3.2 From Radial Processes to Neural Fibers Recovery 

We can now measure in the intrinsic space of the white matter. The fundamental 
idea of what follows consists of the hypothesis that water molecules starting 
at a given point xq on M , under Brownian motion, will potentially reach any 
point on M through a unique geodesic. The sole knowledge of the metric g will 
enable us to actually compute those geodesics on the manifold inferred from the 
Laplace-Beltrami operator. Considering paths of Brownian motion (ie. fibers 
in the white matter) as the characteristics lines of the differential operator £ 
we can easily extend the concept of radial process for that type of stochastic 
motion on a Riemannian manifold M [34]. Let us fix a reference point xq € M 
and let r{x) = 4>{xq,x) be the Riemannian distance between x and Xg- Then 
we define the radial process = r{Xt). The function r : M ^ K+ has a well 
behaved singularity at the origin. We make the assumption that M is geodesically 
complete and recall the notion of exponential map which will be crucial for the 
numerical computation of neural fibers. We denote by Ce the geodesic with initial 
condition Ce(0) = x and Cg(0) = e (e G T^M). We denote hy E CTM the set of 
vectors e such that Cg(l) is defined. It is an open subset of the tangent bundle 
TM containing the null vectors Oa, G T^M. 

Definition 4. The exponential map exp : E G TM ^ M is defined by exp(e) = 
Cg(l). We denote by exp^ its restriction to one tangent space T^M. 

Hence, in particular, for each unit vector e G T^^M, there is a unique geodesic 
Ce : [0,oo[— >■ M such that c'e(xo) = e and the exponential map gives Ce{t) = 
exp^^(te). For small time steps t, the geodesics Ce[0,t[ is the unique distance 
minimizing geodesic between its endpoints. We need one more notion to conclude 
this section: the cutlocus of xoiCut^o, which will help us to characterize the 
distance function r. It is nothing but the locus of points where the geodesics 
starting orthonormally from xq stop being optimal for the distance. The radial 
function r{x) = 4>{xg,x) is smooth on and we have |grad(}()(a;)| = 1 

Conclusion: We have expressed the distance function on M. The objectives 
of the following section will be to propose accurate algorithms to compute this 
function (j) everywhere on M and then to use it to estimate geodesics (Brownian 
paths) on this manifold (the brain white matter). 

4 Intrinsic Distance Function, Geodesics 

4.1 A Level Set Formulation for the Intrinsic Distance Function 

We are now concerned with the effective computation of the distance function 
(j) from a closed, non-empty subset K of the 3-dimensional, smooth, connected 
and complete Riemannian manifold (M,g). In the remaining, K will actually 
be restricted to the single point Xg, origin of a Brownian motion. We will nev- 
ertheless formulate everything in term of K since considering the distance to a 
larger subset of M will be of interest for future work. Let us now further discuss 
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the notion of distance function on a Riemannian manifold. Given two points 
x,y € M, we consider all the piecewise differentiable curves joining x to y. Since 
M is connected, by the Hopf-Rinow theorem, such curves do exist and 

Definition 5. The distance 4>{x, y) is defined as the infimum of the lengths of 
the curves starting at x and ending at y. 

Corollary 1. If xo G M, the function r : M — >■ K given by r(x) = (/>(x,xo) is 
continuous on M but in general it is not everywhere differentiable. 

We consider a general Hamilton- Jacobi partial differential equation with Dirich- 
let boundary conditions 

f H{x,Dfi{x)) = 0 in M\K 
\ 4>{x) = (j)o{x) when x £ K 

where 0o is a continuous real function on K and the Hamiltonian H : M x 
T*M — >■ K is a continuous real function on the cotangent bundle. We make the 
assumption that H{x, .) is convex and we set 4>o{x) = O'ix £ K. 

We denote by |u| the magnitude of a vector v of TM, defined as y^g(v, v). In 
matrix notation, by forming G = {gij} the metric tensor, this writes V v'^Gv. 
Then, by setting H(x,p) = |p| — 1, we will work on the following theorem (for 
details on viscosity solutions on a Riemannian manifold, we refer to [35]) 

Theorem 1. The distance function (j) is the unique viscosity solution of the 
Hamilton- Jacobi problem 

( \grad(f>\ = 1 in M\K , , 

( fi{x) = 0 when x £ K 

in the class of bounded uniformly continuous functions. 

This is the well-known eikonal equation on the Riemannian manifold (M,g). 
The viscosity solution a,t x £ M is the minimum time t > 0 for any curve 7 to 
reach a point 7(f) G K starting at x with the conditions 7(0) = 0 and j^j < 1. 
4> is the value function of the minimum arrival time problem. This will enable 
us to solve equation 3 as a dynamic problem and thus to take advantage of the 
great flexibility of Level Set methods. On the basis of [36], [37], [38] and [39], we 
reformulate equation 3 by considering as the zero level set of a function ip and 
requiring that the evolution of tf generates (j) so that 

fi{x, t) = 0 <t7 t = (j){x) (4) 

Osher ([36]) showed by using Theorem 5.2 from [39] that, under the hypothesis 
that the Hamiltonian H is independent of (f>, the level set generated by 4 is a 
viscosity solution of 3 if is the viscosity solution of 



ift + P{t,x, Dfi{t,x)) = 0 Vt > 0 
fi{x,0) = fio{x) 



( 5 ) 
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provided that > 0 and does not change sign. This is typically the case for our 
anisotropic eikonal equation where the anisotropy directly arises from the mani- 
fold topology and not from the classical speed function of initial value problems 
(which equals 1 everywhere here). To find our solution, all we need to do is thus 
to evolve t) while tracking, for all x, the time t when it changes sign. Now 
we have to solve 5 with 

F{t,x,Dtp) = H{t,x,Dil)) -I- 1 = Igradi/'l 

We first recall that for any function / € F, where F denotes the ring of smooth 
functions on M, the metric tensor G and its inverse define isomorphisms between 
vectors (in TM) and 1-forms (in T*M). In particular, the gradient operator is 
defined as grad/ = G~^df where df denotes the first-order differential of /. It 
directly follows that 

lgrad*i = ys(gmdv.,g«dg,) = 

and we now present the numerical schemes used to estimate geodesics as well as 
the viscosity solution of 



ipt + \gradtp\ = 0 ( 6 ) 

4.2 Numerical Scheme for the Distance Function 

Numerical approximation of the hyperbolic term in 6 is now carefully reviewed 
on the well-known basis of available schemes for hyperbolic conservative laws. 
We seek a three-dimensional numerical flux approximating the continuous flux 
|grad/>p and that is consistent and monotone so that it satisfies the usual jump 
and entropy conditions and converges towards the unique viscosity solution of 
interest. References can be found in [40]. On the basis of the Engquist-Osher 
flux [37] and the approach by Kimmel-Amir-Bruckstein for level set distance 
computation on 2D manifolds [41], we propose the following numerical flux for 
our quadratic Hamiltonian dijP" G~^ dip'- 

3 

|grad'0p = ^ 0)^ + min(_D+ 0)^) + 

3 

^ ^*^minmod(D+V’, L)^;^)minmod(L)+ 

where the D^.ip are the forward/backward approximations of the gradient in Xi- 
Higher order implementation has also been done by using WENO schemes in 
order to increase the accuracy of the method. They consist of a convex combina- 
tion of (we take n = 5) order polynomial approximation of derivatives [42] . 
A classical narrow band implementation is used to speed up the computations. 
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4.3 Numerical Scheme for the Geodesics Estimation 

We finally derive an intrinsic method for geodesics computation in order to 
estimate paths of diffusion on M eventually corresponding to neural fibers tracts. 
Geodesics are indeed the integral curves of the intrinsic distance function and are 
classically obtained by back-propagating in its gradient directions from a given 
point X to the source xq. Our problem of interest consists of starting from a given 
voxel of the white matter and of computing the optimal pathway in term of the 
distance (f> until Xq is reached. We propose to take into account the geometry 
of the manifold during this integration step by making use of the exponential 
map. If the geodesic c(s) is the parameterized path c(s) = (ci(s), ..., Cd{s)) which 
satisfies the differential equation 

ds^ ds’ ds ds ^ ’ 

where are the Christoffel symbols of the second kind defined as = 
{dgki/dxj + dgji/dxk — dgjk/dxi). Equation 7 allows us to write exp in lo- 
cal coordinates around a point x G M as 

c,(exp(X)) + Vz= 

where X will be identified with the gradient of the distance function at x and 
derivatives of the metric are estimated by appropriate finite differences schemes. 
This leads to a much more consistent integration scheme on M. 

5 Evaluation on Synthetic and Real Datasets 

We have experimented with line propagation local methods which only produce 
macroscopically satisfying results. With trilinear interpolation of the tensor field 
and a 4*^ order Runge Kutta integration scheme, we used the advection-diffusion 
method [13] and obtained the results on Figure 1. Our global approach is actually 
more concerned to resolve local ambiguities due to isotropic tensors. We consider 
synthetic and real data^ to quantify the quality of the estimated distance func- 
tions with upwind and WEN05 finite differences schemes. Our criterion is the 
a posteriori evaluated map jgrad^j which must be equal to 1 everywhere ex- 
cept at the origin xq. As shown on Figure 2 [left], synthetic data corresponds 
to an anisotropic non-homogeneous medium for which the diffusion paths de- 
scribe three (independently homogeneous) intersecting cylinders oriented along 
the X, y and z axis. It results perfectly isotropic tensors at the intersection of the 
three cylinders, surrounded by planar tensors in the area where only two cylin- 
ders cross each others. Though simple, it is a typical configuration where local 
methods become unreliable, xq denotes the origin of the distance function whose 



^ The authors would like to thank J.F. Mangin and J.B Poline, CEA-SHFJ/Orsay, 
France for providing us with the data 
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Fig. 1. Neural tracts estimated by the advection-diffusion based propagation method 




Fig. 2. [left]: Synthetic tensor field (partial), [center]: Associated distance function 
[right]: Real diffusion tensor MRI (RGB mapping of the major eigenvector) 

Table 1. Statistics on |grad<('| for synthetic and real diffusion tensor MRI data 



DataSet 


Scheme 


Mean 


Std. Dev 


Maximum 


Synthetic 


Upwind 


0.9854 


0.123657 


4.50625 


Synthetic 


WEN05 


0.977078 


0.116855 


2.0871 


DT-MRI 


Upwind 


0.994332 


0.116326 


4.80079 


DT-MRI 


WEN05 


0.973351 


0.110364 


3.72567 



estimation with the level set scheme proposed in the previous section exhibits 
very good results in table 1 with a sensible improvement when using WEN05 
schemes. The solution of equation 6 along the axis associated to the cylinder 
containing xq is presented on Figure 2 [center]. The recovery of the underlying 
pathways reaching xq by our intrinsic method turns out to be fast in practice and 
accurate. Figure 3 [left] shows the computed geodesics linking xq to anisotropic 
voxels located at the extremity of a different cylinder. This is basically what 
happens in the brain white matter when multiple fibers bundles pass through a 
single voxel. Our global approach seems particularly adequate to disambiguate 
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Fig. 3. Inferred geodesics by intrinsic integration - [left]: synthetic [right]: real data 



the problem of fibers tracts crossings by minimizing the geodesic distance in the 
white matter. 

Real diffusion data on Figure 2 [right] is used to focus on the posterior part 
of the corpus callosum. Estimation of the distance function with upwind and 
WEN05 schemes produces again very good results with evident advantage in 
term of robustness for WENO implementation. We must notice here that our 
numerical flux tends to be a bit diffusive, resulting in smooth distance functions. 
This may be a problem if the original data itself does not have a good con- 
trast since this could yield geodesics with very low curvature. Exponential map 
based integration produces the result of Figure 3 [right] when starting from the 
extremities of the major forceps. We have noticed that our method is not influ- 
enced by locally spherical or planar tensors since the estimated fibers are not 
affected by the presence of lower anisotropy regions (in red) that coincide with 
crossings areas. This global approach thus brings coherence into diffusion tensor 
data and naturally handles the issues affecting local tractography methods like 
inconsistent tracking in locally isotropic areas. 

6 Conclusion 

Diffusion imaging is a truly quantitative method which gives direct insight into 
the physical properties of tissues through the observation of random molecular 
motion. However correct interpretation of diffusion data and inference of accurate 
information is a very challenging project. Our guideline has been to always bear 
in mind that the true and unique phenomenon that diffusion imaging records 
is Brownian motion. Taking that stochastic process as our starting point, we 
have proposed a novel global approach to white matter connectivity mapping. It 
relies on the fact that probing and measuring a diffusion process on a manifold 
M provides enough information to infer the geometry of M and compute its 
geodesics, corresponding to diffusion pathways. Clinical validation is obviously 
needed but already we can think of extensions of this method: intrinsic geodesics 
regularization under action of scalar curvature of M, geodesics classification to 
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recover complete tracts. Estimation of geodesics deviation could be used to detect 
merging or fanning fiber bundles. 
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Abstract. This article proposes a solution of the Lambertian Shape 
From Shading (SFS) problem by designing a new mathematical frame- 
work based on the notion of viscosity solutions. The power of our ap- 
proach is twofolds: 1) it defines a notion of weak solutions (in the 

viscosity sense) which does not necessarily require boundary data. Note 
that, in the previous SFS work of Rouy et al. [23,15], Falcone et al. [8], 
Prados et al. [22,20], the characterization of a viscosity solution and its 
computation require the knowledge of its values on the boundary of the 
image. This was quite unrealistic because in practice such values are 
not known. 2) it unifies the work of Rouy et al. [23,15], Falcone et al. 
[8], Prados et al. [22,20], based on the notion of viscosity solutions and 
the work of Dupuis and Oliensis [6] dealing with classical (C^) solutions. 
Also, we generalize their work to the “perspective SFS” problem recently 
introduced by Prados and Faugeras [20]. 

Moreover this article introduces a “generic” formulation of the SFS prob- 
lem. This “generic” formulation summarizes various (classical) formula- 
tions of the Lambertian SFS problem. In particular it unifies the ortho- 
graphic and the perspective SFS problems. This “generic” formulation sig- 
nihcantly simplifies the formalism of the problem. Thanks to this generic 
formulation, a single algorithm can be used to compute numerical solu- 
tions of all these previous SFS formulations. 

Finally we propose two algorithms which provide numerical approxima- 
tions of the new weak solutions of the “generic SFS” problem. These 
provably convergent algorithms are quite robust and do not necessarily 
require boundary data. 



1 Introduction 

The application of the theory of Partial Differential Equations (PDFs) to the 
Shape from Shading (SFS) problem has been hampered by several types of diffi- 
culties. The first type arises from the kind of modelling that is used: orthographic 
cameras looking at Lambertian objects with a single point light source at infin- 
ity is the set of usual assumptions [29,10]. The second type is mathematical: 
characterizing the solution(s) of the corresponding PDF has turned out to be 
a very difficult problem; boundary conditions are assumed to be known, say at 
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image boundary, in contradiction with real practice [23,22,8]. The third type is 
algorithmic: assuming that existence has been proved, coming up with provably 
convergent numerical schemes has turned out to be quite involved [7]. 

Our approach is therefore based upon the interaction of the following three 
areas: 

1. Mathematics: We use and “extend” the notion of viscosity solutions to solve 
such basic problems as the existence and uniqueness of a solution or the 
characterization of all solutions when uniqueness does not hold. 

2. Algorithmic: In [2], Barles and Souganidis propose a large class of approx- 
imation schemes (called monotonous) of these solutions. Inspired by their 
work, we build such schemes for the SFS equations from which we obtain 
algorithms whose properties we can analyze in detail (stability, convergence, 
accuracy). This results in provably correct code within a set of well-defined 
assumptions. 

3. Modeling: The classical theory of viscosity solutions (used until now for 
solving the SFS problem [23,15,22,20,8]) is not well-adapted to the natural 
constraints of the SFS problem. In particular it requires that boundary con- 
ditions be given, e.g. at the image boundary, and creates undesirable folds 
(see section 3). In order to be able to get rid of this constraint, we have 
adapted the notion of viscosity solutions. 

Our contributions are first in the area of Mathematics: we adapt the notion of 
singular viscosity solutions (recently developed by Camilli and Siconolfi [3,4]) 
for obtaining a “new” class of viscosity solutions which is really more suitable to 
the SFS problem than the previous ones. This mathematical framework is very 
general and allows to improve and unify the work of [23,15,6,22,20,8]. Directly 
connected to the area of modeling, thanks to the introduction of this framework, 
we are able to relax the very constraining assumption that boundary conditions 
are known. Concerning the area of modeling, we extend the work of [20]: con- 
sidering a pinhole camera, we allow the light source to be either at infinity or 
approximately at the optical center, as in the case of a flash. We also show that 
the orthographic and pinhole camera SFS equations are special cases of a general 
equation, thereby simplifying the formalization of the problem. Our contribu- 
tions are also algorithmic: we propose two provably convergent approximation 
schemes for our “generic” SFS equation. Moreover, one of the algorithms we 
propose seems to be the most efficient iterative algorithms of the SFS literature. 
The article is written in a non mathematical style. The reader interested in the 
proofs is referred to [19,21]. 

2 A Unification of the “Perspective” and “Orthographic 
SFS” 

We deal with Lambertian scenes and suppose that the albedo is constant and 
equal to 1. The scene is represented by a surface S. Let 17, the image, be the 
rectangular domain ]0,A[x]0,M[. S can be explicitly parameterized by using 
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a function defined on the closure 17. The particular type of parametrization is 
irrelevant here but may vary according to the camera type (orthographic versus 
pinhole) and the position of light source (finite or infinite distance) . We note I 
the image intensity, a function from 17 into the closed interval [0, 1]. The Lam- 
bertian hypothesis implies: 



I{x) 



n(a;) • L 
|n(x)| 



( 1 ) 



where n(a;) is a normal vector to the surface S at the point S{x) and L is the 
unit vector representing the light direction at this same point (the light source 
is assumed to be a point). Despite the notation, L can depend on S{x), if the 
point source is at a finite distance from the scene. 



2.1 “Orthographic SFS” with a Point Light Source at Infinity 

This is the traditional setup for the SFS problem. We denote by L = {a, P, 7) 
the unit vector representing the direction of the light source (7 > 0), 1 = (a,/3), 
and u the distance of the points in the scene to the camera. The SFS problem 
is then, given I and L, to find a function u : 17 — K satisfying the brightness 
equation: 

VxGf2, I{x) = (-Vm(x) • 1 -f 7)/y/l -f |Vu(a;)|2, 

In the SFS literature, this equation is rewritten in a variety of ways as H(x,p) = 
0, where p = Vtt: 

1) In [23], Rouy and Tourin introduce Hj^/j’{x,p) = I{x)^/Y+PJpp + p • 1 — 7. 

2) In [6], Dupuis and Oliensis consider 

Hd/o{x,p) = I(a;)yrTlpF^^2^-hp - 1- I. 

(use the change of variables: 'I'{xi,X 2 , z) = (xi, X2, x\a + X 2 P + 27)) 

3) In the case where L = (0,0, 1), Lions et al. [15] deal with: 

HEiko{x,p) = \p\ — (called the Eikonal equation) 

The function H is called the Hamiltonian. 



2.2 “Perspective SFS” with a Point Light Source at Infinity 

Few SFS approaches deal with the perspective projection problem. To our knowl- 
edge, only eight authors [17,13,9,27,28,20,26,5] consider a pinhole camera model 
instead of an affine or orthographic model. Among these papers, only the work 
of Prados and Faugeras [20] proposes a formalism completely based on Partial 
Differential Equations (PDEs) and provides a rigourous mathematical study. 
The camera is characterized and represented by the retinal plane R and by the 
optical center as shown in figure 1. We note f the focal length. We assume that 
S can be explicitly parameterized by the depth modulation function u defined 
on 17: 

S = {u(x).(x, — f)', X £ 17} , 

and that the surface is visible (in front of the retinal plane) hence rt > 1. 

We also note L = (a, /3, 7) the unit vector representing the direction of the 
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light source (7 > 0). Combining the expression of n(x) (easily obtained through 
differential calculus) and the change of variables v = ln{u), Prados and Faugeras 
[18,20] obtain from the irradiance equation the following Hamiltonian: 



By using the change of variables v{x) = ^[7 f — I ■ x]m(x), we obtain another 
Hamiltonian Hpers(x,p) which verifies more interesting properties (see [19]). 



Fig. 1. Images arising from an orthogonal (versus perspective) projection. 



2.3 “Perspective SFS” with a Single Point Light Source Located at 
the Optical Center 

We present a new formulation of the “perspective SFS” . This approximately 
models the situation encountered when we use a simple camera equiped with a 
flash and the scene is relatively far from the camera. In this case, we represent 



Using the same trick as in the previous section (v = ln{u)), we readily obtain 
the Hamiltonian: 




0 




Surface 

b) perspective projection 



a) orthogonal projection 



the scene by the surface S defined 




Hf{x,p) = I{x)J 



+ (p • a;)2 + Q(a;)2 - Q{x) 



where Q{x) = \J f'^ /{\x\^ + f^). See [19] for more details. 
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2.4 A Generic Hamiltonian 

In [19], we prove that all the previous SFS Hamiltonians are special cases of the 
following “generic” Hamiltonian: 

Hg{x,p) = Hg{x,A^p+ vt) + ^ ■ P + Cx, 

with Hg{x, q) = Kx\/\q\^ + 

Kx,Kx> 0, Ax = Dx Rx, Dx =(^ j , iio: is the rotation matrix j 

\ix^Q,Rx = Id 2 ii X = 0, px,Vx 0 {px, Vx G R), G and Cx G R. 

By using the Legendre transform, we rewrite this Hamiltonian as a “generic” 
Hamilton- Jacob i-Bellman (HJB) Hamiltonian: 

Hg{x,p) = sup {-fg{x,a) ■p-lg{x,a)}. 
oeB2(o,i) 

In [19], we detail the exact expressions of fg and Ig. The HJB formulations of 
the Hamiltonians HEikoi Hd/Oi Hr/t E^nd Hp/p, respectively given in [23,6,22, 
20] , are special cases of the above generic formulation; thereby ours is a general- 
ization and a unification of these works. This generic formulation considerably 
simplifies the formalism of the problem. All theorems about the characterization 
and the approximation of the solution are now proved by using this generic SFS 
Hamiltonian. In particular, this formulation unifies the orthographic and per- 
spective SFS problems. Also, from a practical point of view, a unique code can 
be used to numerically solve these two problems. 

3 Weaknesses of the Previous Theoretical Approaches 

The notion of viscosity solutions was first used to solve SFS problems by Lions, 
Rouy and Tourin [23,15] in the 90s. Their work was based upon the notion of 
continuous viscosity solution. The viscosity solutions are PDF solutions in a 
weak sense. In particular, they are not necessarily differentiable and can have 
edges. This notion allows to define a solution of a PDE which does not have 
classical solutions. For example, the equation 

]VM(a:)l = 1 for all x in ]0, 1[ (2) 

with m(0) = m( 1) = 0, does not have classical solutions (Rolles theorem) but 
has a continuous viscosity solution (see figure 2-a)). Let us emphasize that con- 
tinuous viscosity solutions are continuous (on the closure of the set where it is 
defined) and that a solution in the classical sense is a viscosity solution. The 
weakness of this notion is due to the compatibility condition necessary to the 
existence of the solution (constraint on the variation of the boundary conditions 
[14]). Also, the same equation (2) with ■u(O) = 0, ■u(l) = 1.5 does not have 
continuous viscosity solutions. Now let us suppose that we make a large error 
on the boundary condition, when we compute a numerical solution of the SFS 
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Fig. 2. a) Continuous viscosity solution of (2) with it(0) = u(l) = 0; b) discontinuous 
viscosity solution of (2) with u(0) = 0 and u(l) = 1.5. 



problems. If this error is too large then there do not exist continuous viscosity 
solutions. In this case one may wonder what the numerical algorithm of [23,15] 
computes. In [22], Prados et al. answer this question by proposing to use the more 
general idea of discontinuous viscosity solutions. For example, equation (2) with 
u(0) = 0, u(l) = 1.5 has a discontinuous viscosity solution (see Figure 2-b)). Let 
us emphasize that a “discontinuous viscosity solution” can have discontinuities 
and that a continuous viscosity solution is a discontinuous viscosity solution. 

The classical theory of viscosity solutions offers simple and general theorems 
of existence and uniqueness of solutions for exactly the type of PDFs that arise 
in the context of SFS. In particular the theory allows to characterize exactly 
all possible continuous viscosity solutions: given a particular Dirichlet condition 
on the image boundary (verifying the compatibility condition), if the set of 
critical points (points of maximal intensity, i.e. I{x) = 1) is empty, then there 
exists a unique continuous viscosity solution satisfying the boundary conditions; 
If the set of critical points is not empty there exists an infinity of continuous 
solutions which are characterized by their values at the critical points. Note 
that this result is general and applies equally to all the SFS models described 
in section 2 (see [19]). As a consequence, the SFS problem is ill-posed and to 
compute an approximation of a solution, Rouy et al. and Prados et al. [23,22,20] 
must assume that the values of the solutions are given at the image boundary 
and the critical points. This is quite unsatisfactory, even more so since small 
errors on these values create undesirable crests, see figure 3-b) or [22] for an 
example with a real image. Falcone [8] proposes not to specify anymore the 
values of the solution at the critical points (he still requires to specify the values 
at the image boundary though). In order to achieve this, he uses the notion 
of maximal viscosity solutions developed by Camilli and Siconolfi [3]. Despite 
its advantages, this approach is not really adapted to the SFS problem, see for 






Fig. 3. a) original surface u; b) solution Ut associated to corrupted boundary conditions 
and to the image obtained from the original surface a) with the Eikonal equation; c) 
maximal solution Umax (in Falcone’s sense [8]) associated to the same image, u^ and 
Umax present a kink at xo and x\. 
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example figure 3-c). In this figure, the maximal solution Umax associated to the 
image obtained from the original surface u shows a highly visible crest where the 
surface should be smooth. Even with the correct boundary conditions, Falcone’s 
method does not really provide a suitable solution. 

To summarize, the work of Rouy et al. [23], Prados et al. [22,20] and Falcone 
et al. [8] suggests theories and numerical methods based on the concept of viscos- 
ity solutions and requiring data on the boundary of the image. At the opposite, 
Dupuis and Oliensis [6] consider solutions. They characterize a solution 
by specifying only its values at the critical points which are local minima. In 
particular, they do not specify the values of the solution on the boundary of the 
image. Also, they provide algorithms for approximating these smooth solutions. 
Nevertheless, in practice, because of noise, of incorrect modelization, errors on 
parameters or on the depth values enforced at the critical points, there do not 
exist solutions to the SFS equations [16]. Therefore, the theory of Dupuis 
and Oliensis does not apply. 

Considering the drawbacks and the advantages of all these methods, it seems 
important to define a new class of weak solutions such that the characterization 
of Dupuis and Oliensis holds, and which provides a (theoretical and numerical) 
solution when there do not exist smooth solutions. 

As we show in [21], the classical notion of viscosity solutions, like the notion 
of singular viscosity solutions (pioneered by Ishii and Ramaswamy [11] and re- 
cently upgraded by Camilli and Siconolfi [3]) does not provide a direct extension 
of the Dupuis and Oliensis work. For such an extension, we must modify these 
notions and we must consider a “new” type of boundary conditions (called “state 
constraints” [24]). It turns out that the correct notion of viscosity solution for 
the SFS problem is the “singular discontinuous viscosity solution with Dirichlet 
boundary conditions and state constraints” . These solutions can be interpreted 
as maximal solutions and have the great advantage of not necessarily requiring 
boundary or critical points conditions. Moreover, this notion provides a math- 
ematical framework unifying the work of Rouy et al. [15,23], Prados et al. [22, 
20], Falcone et al. [8] and Dupuis and Oliensis [6]. 



4 Singular Discontinuous Viscosity Solutions for SFS 



In this section we briefly describe the notion of “singular discontinuous viscosity 
solutions with Dirichlet boundary conditions and state constraints” (SDVS). 
We refer to [21] for more details. Also we do not recall the classical definition of 
viscosity solutions: see [1] for a recent overview. 

Considering the generic SFS problem, we concentrate on the following HJB 
equation: 



sup{/(cc, a) ■ Vu(a;) — l{x, a)} = 0, 

aSA 



\/x € f2 



(3) 
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To simplify, we assume in this paper^ that I > 0 and we denote S := 
{x I l{x,a) = 0 for some a £ A}. Also assume that S verifies S fl di7 = 0. 
To equation (3), we add Dirichlet boundary conditions (DBG) on the boundary 
of the image and on S: 

u{x) = (fi{x), yxGdnuS; (4) 

if being continuous on U S into K U {+oo} (but tp yf +oo everywhere). At 
the points x s.t. (f(x) = +oo, we say that we impose a state constraint [24]. In 
the SFS context, S is the set of critical points {x \ I{x) = 1}. 

Definition: u is a SDVS of (3)-(4), if u is a discontinuous viscosity solution 
of (3)- (4) on [2 — S and if\/x G S, fu*{x) < (fi{x)] and /u*(x) > tp{x) or is a 
singular viscosity supersolution in Camilli’s sense at the point x]. 

Definitions of m* and u* (not detailed here because of space) can be found in [1]. 
The notion of singular viscosity supersolution in Camilli’s sense is completely 
described in [3,4]. 

In [21], we prove the existence and the uniquess of the SDVS of all SFS equations 
as soon as / is Lipschitz continuous and the Hamiltonian is coercive (e.g. Ho/o 
and Hu/t are coercive I{x) > jlj) . We also prove the robustness of this solution 
to pixel noise and to errors on the light or focal length parameters. Finally, note 
that, when we impose state contraints on the boundary of the images and some 
critical points, this solution can be interpreted as the maximal viscosity solution. 
See [21] for more details. 

5 A General Framework for SFS 

The main interest of this “new” class of solutions lies in the possibility to impose 
the heights of the solution at the critical points when we know them (this is 
impossible with discontinuous viscosity solutions; it is possible with continuous viscosity 
solutions but compatibility conditions are required) and in the possibility to “send 
at infinity” the boundary conditions when we do not know then (this possibility 
is not considered by Falcone et al. [8]). The relevance of this notion is amplified 
by its consistency with the work of Dupuis and Oliensis [6] . This is illustrated 
by the following proposition (see [21]): 

Proposition 1. Let u be a solution of equation (3). Let S be the subset of 
S corresponding to the local minima of u. Lf u verifies the assumption 2.1 of 
[6] ^ then u is the SDVS of (3)-(4) for if{x) = u{x) Vx £ S and q}{x) = +oo 
elsewhere. 

^ In [21], we do not assume that I > 0. Also, the definition of S and the developed 
tools are more sophisticated. Note that, except for Hji/t and Hp/p, all the SFS 
Hamiltonians verify Z > 0. This justifies our interest for the Hamiltonian Hp,/o and 
the original one Hpers. 

^ Not stated here because of space. 
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Therefore, when there do not exist solutions, the SDVS consistently extend 
the work of Dupuis and Oliensis. Moreover, the SDVS unify the various theories 
used for solving the SFS problem. In effect, we can verify that [21]: 

o In the case where the DBG are finite on dH U S and the compatibility 
condition (see [14]) holds, then the SDVS of (3)-(4) is the continuous viscosity 
solution used by [23,15,22,20]. 

o When the DBG are finite on the boundary of the image and state con- 
straints are imposed at the critical points, the SDVS of (3)-(4) corresponds 
to Gamilli’s singular viscosity solutions [3,4] used by Falcone [8]. 
o As seen above, the SDVS corresponds to the solution of (3), verifying the 
assumption 2.1 of Dupuis and Oliensis [6]. 

Gonsequently, the theoretical results of Falcone et al. [8] Rouy et al. [23,15], 
Prados et al. [22,20] and Dupuis et al. [6] are automatically extended to the “per- 
spective SFS” (use Hp and Hpers)- Finally, one can conjecture that by using the 
work of [12,25], the notion of SDVS can be extended to solve SFS problem with 
discontinuous images. This would be very difficult without the tool of viscosity 
solutions. 

6 Numerical Approximation of the SDVS for Generic 
SFS 

This section explains how to compute a numerical approximation of the SDVS 
of the generic SFS equation. This requires three steps. First we “regularize” the 
equation. Second, we approximate the “regularized” SFS equation by approxi- 
mation schemes. Finally, from the approximation schemes, we design numerical 
algorithms. 



Regularisation of the Generic SFS Equation: 

For an intensity image I and e > 0, let us consider the truncated image A 
defined by Ie(x) = min{I{x), 1 — e). By using a stability result, we prove that for 
the generic SFS Hamiltonian, the SDVS associated with the image A converges 
uniformly toward the SDVS associated with the image /, when e — >■ 0. Also, 
Ve > 0, the generic SFS equation associated with A is no more degenerate. Thus 
for approximating this equation, we can use the classical tools developed by 
Barles and Souganidis [2]. 



Approximation Schemes for the Nondegenerate SFS Equations: 

Let us consider the “regularized” generic SFS equation. The theory of Barles and 
Souganidis [2] suggests to consider monotonous schemes. Therefore, we construct 
the following monotonous scheme (we call it “implicit”) S{p, x, u(x),u) = 0 with 

S{p,x,t,u)= max S'g ^^(p, a;,t,M), 

Si ,S2=±1 



where p = AX 2 ) is the mesh size and where we choose: 
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Fig. 4. a) original surface; b) image generated from a) by the Eikonal process [size 
400 X 400]; c) reconstructed surface from b) after 15 iterations of Dupuis and Oliensis’ 
algorithm (based on differential games) enforcing the exact Dirichlet condition on the 
boundary of the image and at all critical points: ei = 0.015, C 2 = 5.7e — 05, Coo = 0.35; 
d) reconstructed surface by the implicit algorithm with the same boundary data and 
after the same number (15) of iterations as c): ei = 0.002, £2 = 1-Oe — 05, too = 0.014. 



= sup <-fg{x,a)- 

,SO I 



— s-]_Ax\ 

t — u{x-\-S‘2Ax2'^) 
-S2AX2 



- lg(x,a) 



^Si,s 2 = {a € A I /g^(a;,a)si > 0 and fg^{x,a)s 2 > 0}. 

By introducing a fictitious time At, we can transform the implicit scheme in 
a “semi-implicit” scheme (also monotonous): 

5— (p, x,t,u)=t- { u{x) + At x, u{x),u) ), 

where At = {fg{x, ag) ■ (1/Axi, 1/ AX 2 )) ag being the optimal control. 

Let us emphasize that these two schemes have exactly the same solutions. 
Using Barles and Souganidis definitions [2], we prove in [19] that these 
schemes are always monotonous and stable. Also, they are consistent with the 
genereric SFS equation as soon as the intensity image is Lipschitz continuous. 
Finally, when the Hamiltonian is coercive, we prove that the solutions of these 
schemes converge toward the unique SDVS of the “regularized” generic SFS 
equation, when p — >■ 0. 



Remark: These two schemes have also a control interpretation. It is easy to verify 
that the implicit scheme is an extension of the control-based schemes proposed by 
[23,15,22] and the semi-implicit scheme corresponds to the control-based scheme 
proposed by [6] . All these schemes have been designed for the “orthographic SFS” 
problem. Note that for a given Hamiltonian, they all have the same solutions. 
Therefore we have unified and generalized these various approaches. 



Numerical Algorithms for the Generic SFS Problem: 

In the previous section, we have proposed two schemes whose solutions con- 
verge toward the unique SDVS of the “regularized” generic SFS equation. For 
each scheme, we now describe an algorithm that computes an approximation of 
u^. 

For a fixed mesh size p = {Ax\, AX 2 ), let us denote x^ := (iAxi,jAx 2 ) and 
X := {xij € 12; i,j € 2,}. The algorithms consist of the following computation 
of the sequence of values [/”■, n > 0 ([/”■ being an approximation of uP{xij)). 
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Algorithm 1. 1. Initialisation (n = 0/' Ufj = uo{xij). 

2. Choice of a pixel Xij G X and modification of Ufj : We choose such 

that V(fc, 1) ^ (i, j), [/”+! = f/S and S{p, x,j,Uf+\U'^) = 0. 

3. Choose the next pixel Xij a X in such a way that all pixels of X are regularly 
visited and go back to 2. 




structed surface from b) with the implicit algorithm (lA) after only 3 iterations, using 
the exact boundary data at all critical points and with state constraints on the bound- 
ary of the image: ei ~ 0.58, £2 — 0.0019, Coo — 0.42; d) reconstructed surface by the 
lA (after 3 iterations) with state constraints on the boundary of the image and at all 
the critical points except at that on the nose: ei ~ 0.60, £2 — 0.0020, too — 0.42. 

We prove in [19] that if itg is a subsolution or a supersolution, then the computed 
numerical approximations converge toward u^. In their work, Rouy, Prados et al. 
[23,15,22,20] use (some particular cases of) the implicit algorithm starting from 
a subsolution. When we start from a supersolution, we reduce the number of 
iterations by 3 orders of magnitude! In [20], Prados and Faugeras need around 
4000 iterations for computing the surface of the classical Mozart’s face [29]. 
Starting from a supersolution (in practice, a large constant function uq does the 
trick!), only three iterations are sufficient for obtaining a good result; see figure 5. 
As an example, we show in figure 4 a comparison of our results with those of 
what we consider to be the most efficient algorithm of the SFS literature [6]. 
Figures 4-c) and 4-d) show the results returned by our implementation of this 
algorithm and our algorithm, respectively, after 15 iterations. The results are 
visually different. This visual difference is confirmed by the computation of the 
errors with respect to the original surface (ci, £2 and £00 are the errors of the 
computed surface measured according to the Li, L 2 and Loo norms, respectively). 
Nevertheless let us note that the cost of one update is slightly larger for our 
implicit algorithm than for the (semi-implicit) algorithm of Dupuis and Oliensis. 
This may also be because we have not optimized our code for this special case. 
Let us add that in this test, we have constrained the solution by the exact 
Dirichlet condition on the boundary of the image and at all the critical points. 
Let us recall that the SDVS method does not necessarily require boundary data. 
Figure 5 shows some reconstructions of the Mozart face when using the exact 
boundary data at all the critical points and state constraints on the boundary 
of the image (Fig.5-c), and with no boundary data, except for the tip of the 
nose (Fig.5-d). Moreover, let us emphasize that our implicit algorithm (as our 
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semi-implicit one) allows to compute some numerical approximations of the SD VS 
of the degenerate (when the intensity reaches 1) and generie SFS problem. Thus, 
we only need to implement a single algorithm for all SFS modelizations. Finally, 
let us remark that, as the theory predicted, our algorithm shows an exceptional 
robustness to noise and errors on the parameters; This robustness is even bigger 
when we send the boundary to infinity (apply the state constraints). Figure 6 
displays a reconstruction of Mozart’s face from an image perturbed by additive 
uniformly distributed white noise (SNR ~ 5) by using the implicit algorithm 
with the wrong parameters b = (0.2,— 0.1) and = 10.5 (focal length) and 
without any boundary data. The original image Fig.6-a) has been synthetized 
with 1 = (0.1,— 0.3) and f = 3.5. The angle between the initial light vector 
L and the corrupted light vector is around 13°. More details, experimental 

comparisons and stability tests can be found in [19,21]. These reports also contain 
the proofs of all our statements. 




Fig. 6. a) Image generated from Mozart’s face represented in Fig.5-a) with 1 = 
(0.1,— 0.3) and f = 3.5 [size ~ 200 x 200]; b) noisy image (SNR ~ 5); c) recon- 
structed surface from b) after 4 iterations of the implicit algorithm, using the incorrect 
parameters h = (0.2, —0.1) and f e = 10.5, and with state constraints on the boundary 
of the image and at all the critical points except at the critical point on the nose. 



7 Conclusion 

We have unified various formulations of the Lambertian SFS problem; in par- 
ticular the orthographic and perspective problems. We have developed a new 
mathematical framework which unifies some SFS theories and generalizes them 
to all SFS Hamiltonians. Let us emphasize that we do not consider Mathematics 
as a goal in itself. Mathematics is simply a powerful tool allowing us to 

• suggest some numerical methods and algorithms; 

• certify algorithms, to guarantee their robustness and to describe their limita- 

tions; 

• better understand what we compute. In particular, when the problem has 

several solutions, it allows to characterize all the solutions, a necessary pre- 
liminary step for deciding which solution we want to compute. 

In effect, our theory ensures the stability and the convergence of our SFS method. 
Also it suggests a robust SFS algorithm which seems to be the most efficient iter- 
ative algorithm of the SFS literature. Moreover, our new class of weak solutions 
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is really more adapted to the SFS specifications; in particular, it does not neces- 
sarily require boundary data. We are extending our approach to non Lambertian 

SFS and to SFS with discontinous images. 
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Abstract. The output of modern imaging techniques such as diffusion 
tensor MRI or the physical measurement of anisotropic behaviour in 
materials such as the stress-tensor consists of tensor-valued data. Hence 
adequate image processing methods for shape analysis, skeletonisation, 
denoising and segmentation are in demand. The goal of this paper is 
to extend the morphological operations of dilation, erosion, opening and 
closing to the matrix-valued setting. We show that naive approaches 
such as componentwise application of scalar morphological operations 
are unsatisfactory, since they violate elementary requirements such as 
invariance under rotation. This lead us to study an analytic and a geo- 
metric alternative which are rotation invariant. Both methods introduce 
novel non-component-wise definitions of a supremum and an inhmum of 
a finite set of matrices. The resulting morphological operations incorpo- 
rate information from all matrix channels simultaneously and preserve 
positive definiteness of the matrix field. Their properties and their per- 
formance are illustrated by experiments on diffusion tensor MRI data. 

Keywords: mathematical morphology, dilation, erosion, matrix-valued 
imaging, DT-MRI 



1 Introduction 

Modern data and image processing encompasses more and more the analysis and 
processing of matrix- valued data. For instance, diffusion tensor magnetic reso- 
nance imaging (DT-MRI) , a novel medical image acquisition technique, measures 
the diffusion properties of water molecules in tissue. It assigns a positive definite 
matrix to each voxel, and the resulting matrix field is a valuable source of infor- 
mation for the diagnosis of multiple sclerosis and strokes [13]. Matrix fields also 
make their natural appearance in civil engineering and solid mechanics. In these 
areas inertia, diffusion and permittivity tensors and stress-strain relationships 
are an important tool in describing anisotropic behaviour. In the form of the so- 
called structure tensor (also called Forstner interest operator, second moment 
matrix or scatter matrix) [7] the tensor concept turned out to be of great value 
in image analysis, segmentation and grouping [9]. 

T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3024, pp. 155—167, 2004. 
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So there is definitely a need to develop tools for the analysis of such data since 
anybody who attempts to do so, is confronted with the same basic tasks as in 
the scalar-valued case: How to remove noise, how to detect edges and shapes, 
for example. 

Image processing of tensor fields is a very recent research area, and a number 
of methods consists of applying scalar- and vector-valued filters to the compo- 
nents, eigenvalues or eigenvectors of the matrix field. Genuine matrix-valued con- 
cepts with channel interaction are available for nonlinear regularisation methods 
and related diffusion filters [17,18], for level set methods [6], median filtering [19] 
and homomorphic filters [4]. To our knowledge, however, extensions of classical 
morphology to the matrix setting have not been considered so far. 

Our paper aims at closing this gap by offering extensions of the fundamental 
morphological operations dilation and erosion to matrix-valued images. Mathe- 
matical morphology has been proven to be useful for the processing and analysis 
of binary and greyscale images: Morphological operators and filters perform noise 
suppression, edge detection, shape analysis, and skeletonisation in medical and 
geological imaging, for instance [15]. Even the extension of concepts of scalar- 
valued morphology to vector-valued data such as colour images, is by no means 
straightforward. The application of standard scalar-valued techniques to each 
channel of the image independently, that means component-wise performance 
of morphological operations, might lead to information corruption in the image, 
because, in general, these components are strongly correlated [1,8]. Numerous 
attempts have been made to develop satisfying concepts of operators for colour 
morphology. The difficulty lies in the fact that the morphological operators rely 
on the notion of infimum and supremum which in turn requires an appropriate 
ordering of the colours, i.e. vectors in the selected vector space. However, there 
is no generally accepted definition of such an ordering [2,16,12]. Different types 
of orderings such as marginal or reduced ordering [2] are reported to result in an 
unacceptable alteration of colour balance and object boundaries in the image [5] , 
or in the existence of more than one supremum (infimum) creating ambiguities 
in the output image [12]. These are clear disadvantages for many applications. 
In connection with noise suppression morphological filters based on vector rank- 
ing concepts [2] have been developed [11,5]. In [3] known connections between 
median filters, inf-sup operations and geometrical partial differential equations 
[10] have been extended from the scalar to the vectorial case. 

In any case, the lack of a generally suitable ordering on vector spaces is a 
very severe hindrance in the development of morphological operators for vector- 
valued images. Surprisingly the situation in the matrix-valued setting is more 
encouraging since we have additional analytic-algebraic or geometric properties 
of the image values at our disposal: (a) Unlike in the vectorial setting one can 
multiply matrices, define polynomials and even can take roots of matrices, (b) 
Real symmetric, positive definite matrices can be graphically represented by 
ellipses (2 x 2-matrices) or ellipsoids (3 x 3-matrices) in a unique way. However, 
there is also the burden of additional conditions that have to be fulfilled by the 
morphological operations to be defined: They have to be rotationally invariant 
and they must preserve the positive definiteness of the matrix field as well, 
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since applications such as DT-MRI create such data sets. In this paper we will 
exploit the analytic-algebraic property (a) and the geometric property (b) by 
introducing novel notions for the supremum/infimum of a finite set of matrices. 
These notions are rotationally invariant and preserve positive definiteness. 

Interestingly, already the requirement of rotational invariance rules out the 
straightforward component-wise approach: Consider for example 



Ai := 



3 2 
2 3 






2 -1 
-1 2 



, ^:= 



3 2 
2 3 



Here, S is the componentwise supremum of Ai, H_ 2 . Rotating Ai and A 2 by 90 
degrees and taking again the componentwise supremum yields 



^'1 





1 

3 



where S' is clearly not obtained by rotating S. This counterexample shows that 
it is not obvious how to design reasonable extensions of morphological operations 
to the matrix-valued setting. 

The structure of our paper is as follows: In the next section we give a very 
brief review of the basic greyscale morphological operations. Then we establish 
novel definitions of the crucial sup- and inf-operations in the vector valued case 
via the analytic-algebraic approach and investigate some of their properties in 
Section 3. Alternatively, in Section 4 we develop new definitions for the sup- and 
inf-operations starting from a geometric point of view. Section 5 is devoted to 
experiments where the two methodologies are applied to real DT-MRI images. 
Concluding remarks are presented in Section 6. 



2 Mathematical Morphology in the Scalar Case 

In greyscale morphology an image is represented by a scalar function f(x, y) 
with (x,y) G IR^. The so-called structuring element is a set B in IR^ that deter- 
mines the neighbourhood relation of pixels with respect to a shape analysis task. 
Greyscale dilation © replaces the greyvalue of the image f{x, y) by its supremum 
within a mask defined by B: 

{f®B){x,y) := sup {f{x-x', y-y') \ {x',y')GB}, 

while erosion 0 is determined by 

{fQB){x,y) := ini {f{x+x' , y+y') \ {x' ,y')GB}. 

The opening operation, denoted by o, as well as the closing operation, indicated 
by the symbol •, are defined via concatenation of erosion and dilation: 

foB-.= {fQB)®B and f • B ■= {f ® B) Q B . 

These operations form the basis of many other processes in mathematical mor- 
phology [14,15]. 
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3 Model 1: An Analytic Definition of Dilation and 
Erosion for Matrix- Valued Images 

The decisive step in defining morphological dilation and erosion operations for 
matrix-valued data is to find a suitable notion of supremum and infimum of 
a finite set of positive definite matrices. For positive real numbers oi,... ,ak, 
A: G IN, there is a well-known connection between their modified p-mean and 
their supremum: 

k 1 

lim ( =sup{ai,... ,afc}. (1) 

i—\ 

A completely analogous relation holds also for the infimum with the difference 
that p now tends to — oo: 

k 1 

lim =inf{ai,... ,afe}. (2) 

i—1 

That means, the p-means can serve as a substitute for the supremum (infimum) if 
p is large. The idea is now to replace the positive numbers in the above relation by 
their matrix generalisations, the positive definite matrices Ai, . . . , ■ However, 

to this end we have to define the p-th root of a positive definite (n x n)-matrix 
A . We know from linear algebra that there exists an orthogonal (n x n)-matrix 
V (which means V^V = VV^ = I, with unit matrix I) such that 

A = y • diag(oi,. . . ,a„) • , (3) 

where the expression in the center on the right denotes the diagonal matrix with 
the positive eigenvalues ai, . . . , of A as entries on the diagonal. Now taking 
the p-th root of a matrix is achieved by taking the p-th root of the eigenvalues 
in decomposition (3): 

Ap := Hdiag(af , . . . , aZ)V^ . 

Note that the p-th power Ap can be calculated in this manner as well. Hence we 

can give meaning to the expression can define new matrices 

sup{Ai, . . . , Afc} and inf{Ai, . . . , A^} via the limits of their modified p-mean 
for p — >■ ±oo : 

Definition 1. The supremum and infimum of a set of positive definite matrices 
Ai,...,Afc are defined as 



supjAi, . 


■ • j 


i—1 


(4) 


inf{Ai, . 


■ • j ^k} 


:= hm 

p—^ — oo \ ^ / 


(5) 
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With this definition, taking the supremum is a rotationally invariant operation, 
i.e. sup{C/Ai[/^, . . . , U = U ■ sup{Ai, . . . , A/^} ■ for any orthogonal 
(n X n)-matrix U. This may be seen as follows. Since i® positive definite, 

there exist an orthogonal matrix V and a diagonal matrix D with = 

VDV^ . As a consequence we obtain 

i—1 i—1 

= (uVDV^U^y =UVDpV^U^ = ^ 

Z=1 



where we have used the facts that U^U = I and UV is also orthogonal. Therefore 
the p-th mean is rotationally invariant for all values of p, and hence also in the 
limits p — >■ ±oo. 

Furthermore the p-th mean (and in the limit also supremum and infimum) 
inherits the positive definiteness of its arguments: Positive definiteness is a prop- 
erty stable under addition, and is also characterised by the positivity of the 
eigenvalues. By construction the p-th power A^ and the p-th root Ap have pos- 
itive eigenvalues whenever A has. Hence taking the p-th mean for any p G IN 
preserves positive definiteness. 

For practical computations we will put p to a sufficiently large number, say 
10 or 20, such that the resulting matrices can be considered as reasonable ap- 
proximations to the supremum resp. infimum of Ai, . . . , A^. 

Alternatively, the limiting matrix M := sup{Ai,... ,Afc} can also be ob- 
tained directly from the eigenvalues and eigenvectors of the given set of matrices 
Ai,... ,Ak- The largest eigenvalue and corresponding eigenvector are directly 
adopted for M . In the 2x2 case, the eigenvector system of M is already deter- 
mined by this condition. The remaining eigenvalue of M is exactly the largest 
eigenvalue from the given set of matrices that corresponds to an eigenvector 
different from that of the largest eigenvalue - in general, the second largest 
eigenvalue from the given set. A similar statement holds in higher dimensions. 
Moreover, replacing largest by smallest eigenvalues, a characterisation of infima 
is obtained. We sketch the proof for suprema of 2 x 2 matrices. Note first that the 
sum ^ A^ does not change if every matrix Ai is replaced by the two rank-one 
matrices and A 2 f 2 'yJ corresponding to the eigenvalue-eigenvector pairs 

(Ai, r^i) and (A 2 , V 2 ) of At. Let now A be the largest eigenvalue from the given set 
of matrices, and A the second-largest one in the sense described above. Without 
loss of generality, assume that the eigenvector of A is (1,0)^; the normalised 
eigenvector for A is some (c, s)^, -I- = 1. Since the contributions of all 

smaller eigenvalues and corresponding eigenvectors vanish in the limit p — >■ -l-oo, 
all we have to prove is that the p-mean 



M„ := AP 



(1,0) + A^> 



(c, s) 



AP + APs^ APcs] 
APcs APc^ J 
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tends to diag(yl, A) for p — >• +oo. We introduce the abbreviations Dp := — 

2yl^’A^(c^ — s^)+A^^ and Ep := 21^ — A^(c^ — s^). Then we can express the eigenval- 
ues of Mp by (I (yl^ -I- A^ ± which tend to yl and A for p — >• -|-oo. An 

eigenvector for the larger eigenvalue is given by \j \/^ ~ ) 

which encloses with (1,0)^ the angle ipp that satisfies tan^ ipp = Since 

the latter expression tends to 0 for p — >■ -|-oo if A < A, we have that the limiting 
matrix is diagonal as claimed. In case A = A we have already that lim M„ is 

p -¥-\-00 

diagonal because of the eigenvalues. This completes the proof. 

With the supremum and infimum operations at our disposal we can apply 
the definitions of the basic morphological operations dilation, erosion, opening 
and closing to matrix- valued images essentially verbatim. 



4 Model 2: A Geometric Definition of Dilation and 
Erosion for Matrix- Valned Images 

We present now an alternative framework of dilation and erosion for positive 
definite symmetric matrices. To this end we remark that a positive definite sym- 
metric n X n matrix A corresponds to a quadratic form Q{x) = A~‘^x, x G IR". 

The ellipsoid x^ A~“^x = 1 centered around 0 is an isohypersurface of Q. This el- 
lipsoid has a natural interpretation in the context of diffusion tensors: Assuming 
that a particle is initially located in the origin and is subject to the diffusivity A, 
then the ellipsoid encloses the smallest volume within which this particle will be 
found with some required probability after a short time interval. The directions 
and lengths of the principal axes of the ellipsoid are given by the eigenvectors and 
corresponding eigenvalues of A, respectively. By including degenerate ellipsoids 
this description is easily extended to all positive definite symmetric matrices. 
Then each positive definite matrix A is represented by the image AB of the unit 
ball B C IR” under multiplication with A. 

Geometric inclusion constitutes a natural semi-order for ellipsoids which leads 
directly to a semi-order for positive definite matrices. 

Definition 2. Let A, B he positive definite matrices. We define that A Q B if 
and only if AB C BB where B is the unit hall in IR”. 

In the language of diffusion tensors A C B means that for particles evolving 
under diffusivities A and B, the ellipsoid in which the first one is most probably 
found is completely contained in the corresponding ellipsoid for the second. 

In the light of this semi-order, it makes sense to define the supremum of a 
set of positive definite matrices as a minimal element (in some sense) among 
all matrices that are greater or equal to all given matrices. Since, however, the 
C semi-order itself is not sufficient to determine such a minimal element, we 
need an additional criterion. Therefore we introduce a second relation ^ which 
is compatible to the first one in the sense that A C B always implies A ^ B. 
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Definition 3. Let A, B be as above. We define that A ^ B if the ordered 
sequence Ai(A) > ... > A„(A) > 0 of the eigenvalues of A is lexicographically 
smaller or equal to the corresponding sequence \i{B) > . . . > \n{B) >0 of B, 
i.e. if there exists an index j,^<j<n+l such that Ai(^) = XfiB) for all 
i < j, and Xj{A) < \j{B) if j < n. 

Note that ^ is not a semi-order in strict sense because it does not allow to 
distinguish between a matrix and rotated versions of it. We can now define the 
supremum of a set of positive definite matrices. 

Definition 4. Let A\^ . . . , Ak be positive definite symmetric matrices. We de- 
fine 



sup{Ai, . . . ,Ak} := S 



where S is chosen such that A^ C S for i = 1, . . . ,k, and S ^ Y for each Y 
satisfying Ai CY for i = 1, . . . ,k. 

By reverting all occurrences of C and ^ we obtain an analog definition that 
introduces the infimum as a ^-maximal element in the set of all matrices which 
are inferior to all given matrices w.r.t. C. The positive definiteness of the so 
defined supremum and infimum is obvious from the definition, as is the rotational 
invariance. A closer look shows that if all Ai are positive definite, one has also 
that the supremum of the inverses A~^ is the inverse of the infimum of the Ai 
and vice versa. This is in analogy to the definitions based on the p-mean. 

Since it is not obvious how to compute the supremum of a given set {A \, . . . , A}f\ 
of tensors, we shall now briefly derive the necessary formulae in the case of 2 x 2 
matrices. Assume that A is the largest eigenvalue of all given matrices, and that 
(1,0)^ is the corresponding eigenvector. Then this eigenvalue-eigenvector pair is 
also one for the desired supremum matrix S. We have therefore S = diag(A, A) 
where A < A is still to be determined. The decisive constraint for A is that for all 
given matrices Ai, the images of the unit disk under S~^Ai must be contained 
in the unit disk. For a single matrix Ai = {cl) this condition comes down to 

V(aA-i + (cyl-i _ cA-i)2 -h ^/{aA-^ - 6A-i)2 -h {cA~^ + cX~^y < 2 

(note that it is insufficient to consider only the largest eigenvalue of 5'“^ A since 
this matrix is in general asymmetric!). From this inequality we obtain by squaring 
twice, re-arranging terms and finally taking the root again that 



A > 



(6^ -I- c^)A2 — (ab — c^y 
A^ — 



(6) 



Iterating over all Ai one finds the smallest A which satisfies all the conditions 
simultaneously. Dismissing the condition that the eigenvector corresponding to A 
is (1,0)^, the eigenvector system of S is still determined by this eigenvector. One 
only has to rotate all matrices Ai using this eigenvector system before computing 
the bounds for A. This completes the algorithm in the 2x2 case. 
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Extension of the algorithm to 3 x 3 and larger matrices works by considering 
suitable sets of 2-dimensional sections to which the above formulae can be ap- 
plied. That it is sufficient to consider sections is a consequence of the following 
observation: Given an ellipsoid centered at the origin and a point outside of it, 
then the smallest ellipsoid centered at 0 that encloses both is tangent to the first 
ellipsoid along an ellipse (or, in higher dimensions, an ellipsoid of next smaller 
dimension). Repeating the above reasoning for the case of erosions, it becomes 
clear that the smallest eigenvalue R of S' and corresponding eigenvector are di- 
rectly obtained as the smallest eigenvalue and corresponding eigenvector of one 
of the Ai. By analog considerations as above one derives upper bounds for the 
remaining eigenvalue A (which is now the larger one). Surprisingly, the bounds 
are the same as in (6), only the relation sign is reverted to <. 

Revisiting the p-mean approach from the viewpoint of the current section, one 
sees that the p-mean supremum M of a set {xli, ... ,Ak} satisfies Ai C M for all 
i = 1, . . . , k, and has the same largest eigenvalue and corresponding eigenvector 
as the supremum S defined here. However, in generic cases SQM and S ^ M 
hold, and the eigenvalues of M except the largest one exceed the corresponding 
ones of S. Thus, M is in general not a minimal element in the set of all Y with 
Ai QY for all i. Analog considerations apply to the p-mean infimum. 

5 Experimental Results 

In order to illustrate the differences between model 1 and 2, we have computed 
their behaviour on two ellipses. This is depicted in Figure 1. We observe that 
model 1 tends to reduce the eccentricity of the ellipses, whereas the more com- 
plicated model 2 is constructed in such a way that it corresponds exactly with 
our geometric intuition. 

As a real-world test image we use a DT-MRI data set of a human brain. We 
have extracted a 2-D section from the 3-D data. The 2-D image consists of four 
quadrants which show the four tensor channels of a 2 x 2 matrix. Each channel has 
a resolution of 128 x 128 pixels. The top right channel and bottom left channel are 
identical since the matrix is symmetric. Model 1 is always shown on the left side. 




Fig. 1. Left: Ellipses representing two positive definite matrices (thick lines), their 
supremum and infimum (thin lines) according to model 1. Right: Same with model 2. 






Morphological Operations on Matrix- Valued Images 163 




Fig. 2. Tensor-valued dilation and erosion. Left column, from top to bottom: Original 
tensor image of size 128 x 128 per channel, dilation model 1 with disk-shaped stencil of 
radius \/5, erosion model 1 with disk-shaped stencil of radius \/5. Right column, from 
top to bottom: Same with model 2. 
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Fig. 3. Tensor-valued opening and closing. Left column, from top to bottom: Closing 
model 1 with disk-shaped stencil of radius \/5, opening model 1 with disk-shaped stencil 
of radius %/5 of the original tensor image depicted in Fig. 2. Right column, from top to 
bottom: Same with model 2. 



model 2 always on the right side. All images are generated using a disk-shaped 
stencil of radius \/E. As mentioned in section 2 the simplified algorithm has been 
used for model 1. Figure 2 shows the results of the erosion and dilation filter 
on tensor- valued data for both models. Corresponding filters give very similar 
results. The main difference, as mentioned before, is the tendency of model 1 to 
reduce direction information faster than model 2 does (see also Figure 4). 

This results in a slightly higher contrast in the images in model 2. A number of 
dark spots that appear in the main diagonal parts of the eroded images indicate 
violations of the positive definiteness condition. Due to measurement errors, 
these are already present in the original data set but are widened by erosion. 
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Fig. 4. The distribution of numerical eccentricities e = -y/l — A|/A^ in the dilated 
images from Fig. 2. 



The experiments for opening and closing can be seen in Figure 3. They 
confirm the previous impression: There is a high similarity between the test 
results from model 1 and model 2, the main difference being in the off diagonal 
where the higher contrast of model 2 is noticeable again. 

The main goal, to create a filter for tensor valued erosion and dilation (and 
the derived opening and closing) which is similar to the scalar case, has been 
achieved by both models. Whereas model 2 shows somewhat better results in 
the experiments, model 1 has the advantage of being simpler to implement by 
using the method based on the two largest eigenvalues. 

6 Conclusions 

In this paper we have extended fundamental concepts of mathematical mor- 
phology to the case of matrix-valued data. Based on two alternative approaches, 
definitions for supremum and infimum of a set of positive definite symmetric ma- 
trices were given. One set of definitions relies on the property of scalar-valued 
p-means that they tend to the maximum and minimum of their argument sets 
for p — >■ ±oo; supremum and infimum of matrix sets are constructed by an 
analogous limiting procedure. The second approach combines geometrical and 
analytical tools to construct suprema and infima as minimal and maximal ele- 
ments of sets of upper resp. lower bounds of the given matrix set. Each of the two 
approaches enables the generalisation of morphological dilation, erosion and the 
further operations composed from these, like opening and closing. In the exper- 
imental part, we have implemented the different concepts and evaluated them 
on diffusion tensor data. Our future investigation will include a more detailed 
study of the morphological framework built on these operations. 
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Abstract. Configurations of dynamic points viewed by one or more 
cameras have not been studied much. In this paper, we present sev- 
eral view and time-independent constraints on different configurations 
of points moving on a plane. We show that 4 points with constant inde- 
pendent velocities or accelerations under affine projection can be charac- 
terized in a view independent manner using 2 views. Under perspective 
projection, 5 coplanar points under uniform linear velocity observed for 
3 time instants in a single view have a view-independent characteriza- 
tion. The best known constraint for this case involves 6 points observed 
for 35 frames. Under uniform acceleration, 5 points in 5 time instants 
have a view-independent characterization. We also present constraints 
on a point undergoing arbitrary planar motion under affine projections 
in the Fourier domain. The constraints introduced in this paper involve 
fewer points or views than similar results reported in the literature and 
are simpler to compute in most cases. The constraints developed can 
be applied to many aspects of computer vision. Recognition constraints 
for several planar point configurations of moving points can result from 
them. We also show how time-alignment of views captured independently 
can follow from the constraints on moving point configurations. 



1 Introduction 

The study of view-independent constraints on the projections of a configuration 
of points is important for recognition of such point configurations. A number 
of view-independent invariants have been identified for static point configura- 
tions [1,2]. They encapsulate information about the scene independent of the 
cameras being used and are opposite in philosophy to the scene-independent 
constraints like the Fundamental Matrix [1], the multilinear tensors [3,4,5], etc. 
Formulating view independent constraints on the projections of dynamic point 
configurations is more challenging and has been studied less. Many configura- 
tions of dynamic points are possible. Points could be in general positions or 
could lie on a plane or on a line. The motion could be arbitrary or constrained. 
An interesting case is linearly moving points with independent uniform veloci- 
ties or accelerations. In this paper, we derive several simple constraints on the 
projections of moving points and their motion parameters. 

* Currently with the Department of Computer Science, Columbia University 
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As an example of view-independent constraints on point configurations, let 
us consider a set of 5 world points P[i] G R^, i = 1..5 and their images p[i] 
in homogeneous and y[t]) in Cartesian coordinates viewed by an affine 
camera M. Let nii be the vector of the first 3 elements in the ith row of M 
and let mu be the fourth element in the fth row. Therefore, x[i] = mi.P[i] -|- 
mi4 and y[i] = m2.P[t] -I- m24- Alternatively, [P[f] 1 x[i ]]'^ [mi TOi 4 -l] = 
[P[i] 1 [ni2 m24 — l] =0. If we have at least five points, then we can 

form a set of equations of the form Ci^i = C2&2 = 0, where 9 i = [mi 171.14]"'", 

62 = [m2 77724]"'^ and each row of the measurement matrix Ci (or C2) consists 
of the unknown world point P[i], unity and the x[i] (or y[i]) coordinate. Note 
that the camera parameters are factored out into vectors 6*1 and 9 i . It is obvious 
that Cl and C2 are rank deficient and expanding their 5x5 determinant results 
in constraints of the form X]i=i = 0 and X]i=i oavl'i] = 0 where a, are 

functions of the world position of the points P^ and hence is the same for all 
views, i.e the Oj are view-independent coefficients. Note that the coefficients of 
x[i] and y[i] in the above constraint are the same as. Thus, the total number of 
unknowns is 4 (up to scale). Each view gives two equations in terms of a. There- 
fore, we need two views of the five points to compute all the view-independent 
coefficients. 

When the points are not in general position, the rank of C would be less 
than 4 giving rise to simple algebraic constraints. A configuration of four points 
on a plane yields a view-independent constraint defined over two views. Three 
points on a line yield a view-independent constraint that can be computed from a 
single view itself. For linear motion, we can arrive at view-independent algebraic 
constraints by factoring out the camera parameters. 

We derive several view-independent constraints on the projections of a dy- 
namic scene in this paper. They are independent of the camera parameters. Some 
of these constraints are time-dependent while others are time-independent. The 
computational requirements of these constraints depend on the configuration 
and on its dependence on time. We also derive constraints on points with arbi- 
trary planar motion under affine projection. These are computed from a Fourier 
domain representation of the trajectory. The constraints derived here find appli- 
cations in recognition of dynamic point configurations in multiple views, time- 
alignment between views, etc. 



2 Points with Linear Motion 

We first consider the case of uniform linear motion. When a point moves in the 
world with uniform linear velocity or acceleration, its projections in various views 
move in a parameterizable manner. The view-independent relationships between 
projections of points moving with uniform velocity presented recently [6] fall 
under this category. The two view constraints on points moving with uniform 
velocity [7] is another contribution in this direction. In this section, we study 
the projection of points moving in a linear fashion imaged under affine and 
projective camera models. Let P be a 3D world point, moving with uniform 
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linear polynomial motion. Its position at any time instant t is given by 



Pf = 



I 


+ 


Qi 


t -\- 


Q 2 




Qn 


1 




0 




0 




0 



( 1 ) 



where I is the initial position and Qi are 3- vectors. Let = [x[ he the 

projection of P in view I at time t due to a camera characterized by the camera 
matrix Mb 



2.1 Uniform Linear Motion under AfRne Projection 



When the camera is affine, we can differentiate the projection p[ with respect 
to t to get the velocity 



V 



i 

t 



2 = 1 



Qi 

0 






(2) 



If the point moves with uniform velocity U in the world, the image velocity can 
be written as v* = M*[U 0]"'". Thus, the projected point moves with uniform 
velocity that is the projection of the world velocity. If the point moves with 
uniform linear acceleration, its image velocity is given by v* = M*([U 0]^ + 
[A 0]^ t) and its image acceleration is given by a* = M^A, where A is the world 
acceleration of the point. This implies that the projection of a point moving with 
uniform linear acceleration in the world has uniform linear acceleration [8] . Such 
simple parameterization is not available for the general projective camera. 



2.2 Uniform Linear Motion under Perspective Projection 

We consider the image motion of points undergoing uniform linear motion in the 
world. Since the point Pt projects to p[ = M^Pj, we can write Xt and yt as 



Xt 






and yt 



Eto 



(3) 



where ipi, (f>i, and Xi &re functions of I, Qi, and M* and hence constant for a 
point in a particular view. We can parameterize the projection of the point at 
time t with 3n + 2 unknowns up to scale. These parameters can be computed 
from |"(1.5n+ 1)] time instants since each time instant provides two equations. 

We can parameterize the moving point as the intersection of the line of 
motion of the projection and lines perpendicular to it at various time instants. 
If (6, —a, d) is the line of motion of the projection in the image over time, the 
line perpendicular to it can be written as 1(f) = {a,b,c{t)). Since a and b are 
constants, only c(t) (a measure of the distance of the line from the origin) varies 
with time. Only two of the three parameters a,b,d are independent as a line is 
defined up to scale. Since pt lies on l(t), we have l(t)"'’pt = 0. Replacing pt with 
MP( and expanding, we get 



+ c{t) 77* f ) = 0 
2 = 0 2=0 



( 4 ) 
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where iii and rji are functions of I, Qi, M^, a, and b, and are constant for a point 
in a view. The term c(t) can be parameterized using {2n + 1) unknowns up to 
scale. Since /r’s and 77’s are functions of </)’s, ip's and x’s, no new information 
is gained by this parameterization. However, the time-dependent part of the 
motion can be parameterized using fewer parameters, by factoring the time- 
independent parts out. The point at time t can be obtained by taking the cross 
product of the lines (6, —a, d) and (a, b, c{t)). This representation of the position 
of the projection has fewer essential unknowns than the parameterization of 
Equation 3. 

Uniform Linear Velocity: If n = 1 Equation 4 becomes /xq -I- /iit -I- c{t){r]o + t) = 
0 with rji = 1. The parameterization will have 3 unknowns up to scale and 
needs 3 time instants to compute them. The parameters have been partitioned 
into time-dependent and time-independent parts. The line of motion (b,—a,d) 
(2 unknowns up to scale) can be computed from projections at any two time 
instants. Together, the time-dependent and time-independent aspects make up 
the 5 degrees of freedom associated with the system. 

Uniform Linear Acceleration: The simple parameterization gives Xt and yt as 
ratios of two polynomials in t of degree 2, with 8 unknowns and needs mea- 
surements at 4 time instants to compute them. The new parameterization has 
only 5 unknowns and can be determined from 5 time instants. The polynomial 
constraint is given by /xq -I- fiit + yL 2 t^ + c{t){rio + rjit -|- = 0 with 772 = 1. 

2.3 General Linear Motion 

Under general linear motion, the trajectories of the points will be straight lines 
and constraints on matching lines in multiple views are satisfied by each moving 
point independently. If a world line is imaged by projective cameras as l^, P, 
and l^ , the projections are related by a trilinear constraint [3,4,9] as 

(5) 

where T is a suitable tensor. This gives a constraint on the trajectories of points 
undergoing general linear motion. Nothing more can be said about them since 
no more information is available other than the linearity of their trajectories. 

3 Motion Analysis in Fourier Domain 

If we have a number of moving points, their collective properties can be exploited 
in addition to the motion constraints. Properties of collections can be captured 
in the Fourier domain. We consider a configuration of a large number of points 
moving with independent uniform linear velocities in this section. We also explore 
Fourier domain representation of a point undergoing arbitrary co-planar motion. 

3.1 Multiple Linearly Moving Points 

Recognition of deformable shapes has been studied and applied to tracking 
of non-rigid objects when the deformation between two consecutive frames is 
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small [10,11], in the context of handwriting recognition [12,13], and for contour 
extraction and modeling [14]. Some approaches suggest learning a deformable 
model from examples, while some use deformable templates and ascertain a 
match by determining how much a template has to be deformed to get the test 
shape. These techniques do not assume any specific structure in the deforma- 
tion. Our work on the other hand attempts to develop a sound theoretical model 
when the deformation has a particular structure. 

Let P[t] be the sequence of N points moving with independent uniform linear 
velocities V[z] like points on the envelope of an evolving planar boundary. Let 
the projection of P[z] in view I at time t be p[[i\. A homography maps points in 
one view to points in the other [1]. If the homography is affine 

p'[z] = A'p?[*]-fb', 0<i<N (6) 

where A* is the upper 2x2 minor of the homography and b* is taken from 
its third column. An unknown shift A; aligns the points between views 0 and 
1. Taking the Fourier transform of Equation 6 and ignoring the frequency term 
corresponding to fc = 0, we get 

Pilk] = A'P°[fc]e^'2’"^''=/^, 0<k<N (7) 

where P[ = [X' Y']^; X[ and Y' are the Fourier transforms of the sequences 
x[ and y\ respectively. A point moving with uniform velocity in the world moves 
with uniform velocity in an affine view (Section 2.1). The projection at any time 
t is given by p\[i] = Pq[z] -I- v*[t]t, Q <i < N where v*[i] = [v’‘^[i] I’y [*]]'’■ is the 
velocity vector in the image. Taking the Fourier Transform of both sides, we get 

P\[k]=Pl,[k]+V^[k]t (8) 

where V* is the Fourier Transform of the sequence We define a sequence 
measure k on P( as 

4{k] = P[[k]*^ I P[[k], 0<k<N (9) 

Using Equations 7 and 8, it can be shown that 

K[[k] = |A*| {ai[k]+a2[k] t + a^lk] t^) (10) 

where a’s are functions of measurements p*^ and (or their Fourier domain rep- 
resentation) made only in the reference view. The k sequence and hence the a’s 
are pure imaginary and can be computed by observing 2 frames in the reference 
view to determine position (pg) values and velocity (v°) values. No time syn- 
chronization or point correspondence is required between views as the shift term 
A/ gets eliminated in k. In the reference view, K^[k] = a\[k] + a 2 [k] t + as[fc] 
The sequence measure n\.[k] is thus view-independent but time-dependent. 
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3.2 Arbitrary Motion of a Point 

Are there any view-independent constraints on a point undergoing arbitrary pla- 
nar motion? The image of the point in any view will trace out a contour (closed 
or open) over time, which is the projection of its world trajectory. The problem 
reduces to the analysis of planar contours and view-independent constraints for 
planar contours will be applicable to the moving point. We now present the con- 
tour constraints presented in [15] to characterize arbitrary planar motion of a 
point under affine imaging conditions. 

Let P[z] be the sequence of N points on the closed planar trajectory of a point 
and let (a;*[t], y’’[i\) be its images in view 1. (The index i is a measure of time in 
this case as the point is at different locations at different times.) Assuming that 
the views are related by an affine homography, the points on the contour in view 
I are related to corresponding points on the contour in the reference view 0 as 

p'[i] = A'p°[f] -bb', 0<i<A (11) 

where A* and 6* are as in Equation 6. The time alignment information across 
views is not typically available. Taking the Fourier transform of Equation 11 and 
discarding the DC term, we get P*[fc] = 0 < k < N where A/ 

is the time alignment parameter and P* the Fourier transform as in Equation 7. 
We can define a time-independent sequence measure k’’ similar to the one given 
in Equation 9. We can easily see 

K‘[k] = I A' I K°[k]. (12) 

Thus, K[k] is a relative view-invariant sequence for the point having arbitrary 
motion, ft can be computed in any view by tracking the point over time to 
construct the contour p[zj. 

4 Applications of View-Independent Constraints 

We describe how the parameterizations and constraints developed in this paper 
can be applied to the problems of recognition and time-alignment. 



4.1 Configuration of 4 Points under Affine Projection 

In Section 2.1 we had parameterized the velocity and acceleration of the projec- 
tion of a point moving with uniform linear polynomial motion. We now use those 
parameterizations to derive view independent constraints on configurations of 4 
points moving with independent uniform linear motion parameters. 

Equation 2 can be written as w(,(t) = mi Qi &nd Vy{t) = m 2 Qi 
Rearranging terms, we get 



n 

viit) 


[mi - 1]"^ = 


n 









[m 2 - 1]'^ = 0 
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where [m.i mi 4 ] is the t’th row of M^, and and Vy are the x and y components 
of the point’s velocity in view 1. If we have at least four points, then we can form 
a set of equations of the form CO = 0, where each row of the measurement matrix 
C consists of the unknown world point motion parameters Qi, and the velocities 
along the x or y coordinate. C is a rank deficient matrix with a maximum rank of 
3. Equating its 4 x 4 determinant to 0 results in the following linear constraints. 

Cov[x + Cl'l’L + C 2 V 3 X + Csvix = Co^'ly + Cl'*^2i/ + + Cs'^^iy = ^ (13) 

where Q is a polynomial of order 3(n— 1) in the time-parameter t. Q's are view- 
independent as each has in — 2 terms that are functions of the world motion 
parameters. The total number of view independent parameters is 4(3n — 2) — 1 = 
12n — 9 up to scale. Each time instant provides 2 equations in the unknowns; 
we need (6n — 4) measurements of the velocities in one or more frames in one or 
more views to compute the Q values. 

Uniform Linear Velocity: When the points move with independent uniform linear 
velocities, n = 1 and we get linear view and time independent constraints on 
the velocities of the projections. These constraints have 3 view independent 
coefficients, computing which needs the measurement of the velocities of the 
four points in 2 views. 

These results are better than the Recognition Polynomials and Shape Ten- 
sors presented earlier [7,6]. A view independent representation of a configura- 
tion of stationary points could be constructed from 2 views of 4 points under 
orthographic projections [16]. This was extended to recognize human gait using 
2 views of 5 points under scaled-orthographic projections [7]. Time-dependent 
constraints involving a single view of 5 points with uniform velocity is presented 
in [6] for affine projection. Our results yield view and time independent con- 
straints involving 4 points in 2 views under general affine projection - which is 
a significant improvement. 

Uniform Linear Acceleration: When the points in the configuration move with in- 
dependent uniform linear accelerations, n = 2, giving us linear, view-independent, 
time-dependent constraints on the velocities of the projections. These constraints 
have 15 view independent coefficients computing which needs measuring the ve- 
locities of the four points at a total of 8 time instants in one or more views. 

Proceeding in a similar manner and factoring out the camera parameters as 
above, we can formulate linear time and view independent constraints on the 
accelerations of the projections, which have the same form and computational 
requirements as the constraints on the velocities of the projections of points 
moving with uniform linear velocities. 

4.2 Configurations of 5 Points under Projective Cameras 

Invariants provide us with the ability to come up with representations of the 
features in a scene that do not depend on the view, and can prove to be ex- 
tremely handy when processing information from multiple views. For instance, 
to recognize a configuration of five coplanar points from any view of the same, we 
can compute the cross ratio of areas of the projections of the five points, which 
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would be the same no matter which view we compute it in [2]. The cross-ratio 
of the areas of five points x[l], x[2], x[3], x[4], and x[5], no three of which are 
collinear, is defined as 



cr(x[l],x[2],x[3],x[4],x[5]) 



^x[l]x[2]x[5] ■ ■^x[3]x[4]x[5] 
^x[l]x[3]x[5] ■ ■^x[2]x[4]x[5] 



(14) 



where Ax[i]x[j]x[fe] is the area of the triangle formed by points x[t], x[j], x[fc]. 
This is for a static configuration of points or for snapshots of the scene taken at 
the same time. We now extend this to dynamic scenes where points move with 
uniform linear velocities or accelerations to arrive at time varying invariants for 
such configurations. Due to the novel parameterization for projective cameras 
described in the previous section, the number of unknowns needed to compute 
the time- varying invariants are fewer when compared to a naive parameterization 
approach. 

Uniform Linear Velocity: If the points lie on a plane during the motion, the 
various views of the point configuration are related by a projective homogra- 
phy [1]. To express the configuration in a view-independent manner, we use an 
invariant to projective transformations of 2D [2]. Given the projections of a con- 
figuration of five coplanar points, which are in general position in the image, 
i.e., no three are collinear, we can define an invariant like the cross ratio of areas 
(Equation 14). The cross ratio of areas of the parametric representations of the 
projections of five points having independent uniform velocities is the ratio of 
two polynomials of degree 6 in the time parameter t. 

Kit) 

Dm 



where /(, (t) is the invariant computed in view I at time t and 7* and 6’’ terms are 
functions of the parameters used to represent the points in view 1. The number of 
essential unknowns in this expression is only 15 (3 for each point) and measure- 
ments made in only three time instants in each view are required to determine 
this time varying invariant. This is a significant theoretical advancement over 
the formulation presented in Levin et al. [6] that requires the projections of 6 
points having coplanar independent uniform linear velocities, has 35 unknowns, 
computing which need 34 time instants. 

To recognize a configuration, we need to determine whether the invariants 
computed in all the views are identical or not. This implies that 



im = im 



Nm Nj{t) 
m) Di{t) 



Nm * Diit) = Nt{t) * Dm 



Therefore, for a configuration of 5 points moving with uniform linear velocities, 
the ratio of the coefficients of V in and (t) * (t) should be 1 for 

0 < i < 12. This necessary constraint for recognition, however, holds only when 
time-alignment across views is known. For recognition, we can also make use of 
the additional necessary constraint that there should exist a unique homography 
that maps the lines of motion of the projection in the test and reference views. 
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Uniform Linear Acceleration: When all 5 points of a configuration moving with 
independent linear accelerations lie on the same plane always, we can define 
a time varying invariant for the configuration similar to the one above. The 
time varying invariant obtained on computing the cross ratio of the areas of the 
parametric projections of the configuration is the ratio of two polynomials of 
order 12 in the time parameter t and has only 25 unknowns (5 for each point) 
determining which need measurements made at 5 time instants. 

where Ia{t) is the invariant computed in view I at time t and the and r* terms 
are functions of the parameters used for representing the points in view 1. 

As in the case of uniform linear velocity, the value of the invariant computed 
in all the views have to be the same, which implies that the ratio of the coeffi- 
cients of f in N^{t) * D'^{t) and Nl{t) * D^{t) should be 1 for 0 < i < 24. Like 
in the case of uniform velocity, for recognition of the configuration, we can make 
use of the additional necessary constraint that there should exist a homography 
that relates the lines of motion in the two views. 

4.3 Recognition Constraints in Fourier Domain 

In Section 3.1 we had modeled configurations of many points having indepen- 
dent uniform linear velocities and their motion in the Fourier domain. In this 
subsection, we use those models to derive constraints for recognizing such con- 
figurations. 

Configuration at the Same Time in Multiple Views: It has been shown in Equa- 
tion 10 that 

«'[fc] = |AV?[fc] (15) 

Equation 15 provides a recognition mechanism for such a case. Given M views, 
we can compute a M x (TV — 1) measurement matrix Ci constructed by stacking 
the k\ measures for the various views, one row for each view. Since the various 
rows are scaled versions of each other, the rank of Ci would be 1. Therefore a 
necessary algebraic recognition constraint is ranfc(Ci) = 1. 

Configuration at Different Times in Multiple Views: The problem of recognizing 
the contour when we have its views at different time instants is a more challenging 
problem. Let us assume that in the reference view (0), we are able to track the 
points in two frames (identify points in a view across time) and hence able to 
identify all as. Now given the configuration observed in any other view at any 
time t, we can recognize it to be the same as the one observed in the reference 
view. Observe that Equation 10 states that k\ is a linear combination of the 
vectors a^, the time t being a component of the linear combination coefficients. 
Given M views, we can construct a (M -I- 2) x {N — 1) measurement matrix C 2 
whose first three rows contain the vectors a^, i = 1,2,3. The k\ computed in 
the various views (except the reference view) then contribute one row each to 
C 2 . Note that the time instants at which k is computed in a view need not be 
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the same in all views. Since, every row constructed from can be expressed 
as a linear combination of the first 3 rows, a necessary algebraic recognition 
constraint is ran/c (C 2 ) = 3. This technique does not need correspondence across 
views and assumes tracking only in the reference view. 

Recognizing Arbitrary Point Motion: In Section 3.2, we modeled the motion of a 
point moving on a closed arbitrary planar trajectory in the world as a contour 
and mapped the problem of its analysis to contour analysis. We evaluate the k 
measure for the Fourier domain representation of the contour in view 1. It can 
be shown that [15] 

K‘[k] = |A'| K°[fc], 0<k<N. (16) 

The K values can be computed independently for each view from the Fourier 
domain. The k sequence is invariant up to scale and can recognize the contour 
formed by the motion. Given M views of the motion, we can construct a M x 
{N — 1) matrix C^,, the ith row of C^, consisting of the k values computed in the 
ith view. It can be seen from Equation 16 that rank of is I. This constraint is 
view-independent as the k can be computed independently in each view. There 
are no restrictions on the number of frames in which the motion is observed. 
In practice the Fourier transform will be reliable only if the curve has sufficient 
length. If a number of points can be tracked independently, each contour will 
yield a different constraint, all of which have to be satisfied simultaneously. The 
above result hints that there can exist a number of algebraic constraints on the 
trajectory traced out by the projections of a moving point in a view. 



4.4 Time Alignment 

The recognition constraints presented here do not need time alignment infor- 
mation across views. We can determine time alignment using these constraints 
as we show next. This time alignment can then be used to align frames of syn- 
chronized videos captured from multiple viewpoints. We consider the problem 
wherein we have to time align two image sequences A and B of the same world 
motion. To do this, we need to determine the shift A that when applied to B 
would ensure that the fcth image in each sequence is a snap shot of the world at 
the k time instant. 

Point Configurations of 5 Points: We can use the invariants described in Sec- 
tion 4.2 to time align views A and B of a configuration of 5 points moving with 
independent uniform linear velocities or accelerations. Let time t in view A be 
the time instant with reference to which we want to align view B. In view A, we 
compute the value of the invariant for the point configuration at time t and in 
view B, we compute the parameters of the time varying invariant (for uniform 
velocity or acceleration as the case may be) . We then perform a search over the 
range of possible values of A seeking that shift at which the invariants computed 
at times t in A and {t -I- A) in B are identical. 

Point Configurations of Many Points: The techniques for recognizing a deforming 
contour presented in Section 4.3 do not depend on the time instant at which the 
K values are computed in a view. In fact, they can be used to determine the time 
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parameter. Let be the k sequence computed at time t in view B. Normalizing 
K\.[k] (Equation 10) with respect to a fixed frequency (say p) gives 

K\.[k] _ ai[k] + a 2 [k]t + a 3 [k]t'^ 

[p] ai[p] + a 2 [p]t + as [p] 

The as can be computed in the reference view A, if we are able to track points 
in it for at least two frames. Equation 17 is a quadratic in time t, solving for 
which, we can find the time instant (frame number) in A corresponding to the 
time instant r in B. The value of A in this case is given by {t — t). 

Arbitrary Motion: In section 4.3 we have described how we can use the n measure 
to recognize the projections of the closed planar trajectory of a point undergoing 
arbitrary motion. We can modify the definition of n to define a new measure n' 



K'\k] 






0 

-1 






= 0<fc<A (18) 



where p is a constant (typically 1 or 2). The ratio of will be a complex 
sinusoid. The inverse Fourier transform of this quotient series would show a peak 
at A. Thus, by looking for a peak in the inverse Fourier transform spectrum of 
the quotient series, we can determine time alignment information. 

Note that we have considered p[t] and v[z] to be independent. In an appli- 
cation, one would expect them to be correlated and consequently the signals 
representing the sequences of positions and velocities would be smooth. As a 
result, the higher frequencies in their Fourier representation would be negligible 
and hence we can work with fewer frequencies in these cases. 



5 Discussion 

In this paper, we presented several constraints on the projections of coplanar 
points in motion. Linear motion with uniform velocity or acceleration and arbi- 



Table 1. Motion: Summary of the multiview constraints on moving points in general 
position (unless stated otherwise). (CC = Coplanar Configuration) 



Type 


Camera 


Conditions 


Time Invariant 


Source 


Uniform V 


Affine 


5 pts, 8 frames 


No 


Levin et al. 


Uniform V 


Affine 


4 pts, 2 views 


Yes 


Authors 


Uniform V 


Projective 


6 pts, 49 frames, 1 view 


No 


Levin et al. 


Uniform V 


Projective 


6 pts, 35 frames, 1 view, (CC) 


No 


Levin et al. 


Uniform V 


Projective 


5 pts, 3 frames, 1 view, (CC) 


No 


Authors 


Uniform A 


Affine 


4 pts, 9 frames 


No 


Authors 


Uniform A 


Affine 


4 pts, 2 views 


Yes 


Authors 


Uniform A 


Projective 


5 pts, 5 frames, 1 view, (CC) 


No 


Authors 


Uniform ui 


Projective 


6 pts 


Yes 


Levin et al. 


Uniform V 


Affine 


Many pts (CC) 


No 


Authors 


Arbitrary 


Affine 


1 pt. Planar closed trajectory 


No 


Authors 
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trary planar motion were considered. Table 1 summarizes the constraints avail- 
able on moving points. Our constraints have fewer computational requirements 
than published results. We showed how these constraints translate into recog- 
nition constraints. We also presented methods to compute the time-alignment 
between views from image structure only. These can form the basis of recogni- 
tion applications like human identification using motion characteristics, tracking 
moving points for ballistic applications, detecting inconsistent video sequences 
of a dynamic scene based on geometric inconsistency, etc. 
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Abstract. A Brownian motion model in the gronp of diffeomorphisms 
has been introduced as creating a least committed prior on warps. This 
prior is source destination symmetric, fulhlls a natural semi-group prop- 
erty for warps, and with probability 1 create invertible warps. In this 
paper, we formulate a Partial Differential Equation for obtaining the 
maximum likelihood warp given matching constraints derived from the 
images. We solve for the free boundary conditions, and the bias toward 
smaller areas in the finite domain setting. Furthermore, we demonstrate 
the technique on 2D images, and show that the obtained warps are also 
in practice source-destination symmetric. 



1 Introduction 

In any non-rigid registration algorithm, one must weigh the data confidence 
against the complexity of the warp field mapping the source image geometrically 
into the destination image. This is typically done through spring terms in elastic 
registration [3,8,7], through the viscosity term in fluid registration [5] or by 
controlling the number of spline parameters in spline-based non-rigid registration 
[ 1 , 20 ]. 

If non-rigid registration algorithms, symmetric in source and destination, can 
be constructed, many problems in shape averaging and shape distribution esti- 
mation can be avoided. The regularizer is not symmetric with respect to source 
and destination in the methods mentioned above. While symmetric regularizers 
can be constructed in most cases simply by adding a term for the inverse reg- 
istration [6], this solution is not theoretically satisfactory. In [17] we show that 
Brownian warps, described in detail below, are source-destination symmetric. 
They are constructed as a Brownian motion in the group of diffeomorphisms 
(Section 2 and 3). 

This distribution on warps leads through a Maximum a Posteriori inference 
scheme to a functional minimization formulation. We derive a Partial Differential 
Equation as the gradient descend in this functional, and use a straightforward 
time explicit and spatial forward-backward scheme for discrete implementation 
(Section 4). 

Finally, we give results on how source-destination symmetric the discrete 
implementation is in practice, and give comparisons to thin-plate spline warps 
in a randomized and bootstrapped experiment. 
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2 Definitions and Motivation 

A non-rigid registration may be modeled by a warp field W : i— >■ map- 

ping points in one H-dimensional image into another H-dimensional image. We 
give the definition: 

Definition 1 (Warp Field). A warp field W{x) : i— >■ IR'^ maps all points 

in the source image Is{x) : i— >■ IR into points of the destination image Id{x) : 

IR^ I— >■ IR such that Is{W{X)) is the registered source image. W is invertible and 
differentiable (i.e., a diffeomorphism) and has everywhere a positive Jacobian 
det{d^,W^) > 0 

Here, we have made the assumption that warps are invertible and differ- 
entiable. This corresponds to, that we do not wish to warp in the case where 
structure change topology. Modeling shape changes, shape variability, this is 
the optimal setting. However, in computer vision problems like stereo and flow 
computation from projected images, occlusion boundaries is not modeled by our 
approach. In these cases, the ecological statistics of the local warps must modeled 
taking the projection into account. 

A diffeomorphism will always have the same sign of the Jacobian everywhere. 
Our choice of positive Jacobian applies to those cases where the object is not 
geometrically reflected. 

The identification of a warp field on the basis of images is a matter of in- 
ference. Below we will apply the Bayes inference machine [13], but a similar 
formulation should appear when using information theoretic approaches such as 
the minimum description length principle [18]. 

We wish to determine the warp field W that maximizes the posterior 

p{W\Is,Id) = ^p{Is,Id\W)p{W) 

where Z is a normalizing constant (sometimes denoted the partition function), 
p{Is, Id\W) is the likelihood term, and p{W) is the warp prior. The likelihood 
term is based on the similarity of the warped source and destination image and 
may, in this formulation, be based on landmark matches [4], feature matches [15, 
19], object matches [2], image correlation [15], or mutual information [21]. The 
major topic of this paper is the the prior p(W) that expresses our belief in the 
regularity of the warp field prior to identifying the images. 



3 Brownian Warps 

We seek that distribution of warps which is the analogue of Brownian motion. 
We wish this distribution to be independent of warps performed earlier (i.e., 
invariant with respect to warps). This property is of fundamental importance 
particularly when determining the statistics of empirical warps, creating mean 
warps etc. In such cases, it is required by consistency in order to avoid the use of 
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a fiducial pre-defined standard warp. We may formulate this in the probabilistic 
setting as: 

p{W = W2oWi) = j p{W2)dWi. 

This corresponds to the semi-group property of Brownian motion: The distri- 
bution of positions after two moves corresponds to two independent moves and, 
through the central limit theorem, leads to a Gaussian distribution of positions. 
Since this also holds for a concatenation of many warps, we can construct a warp 
as 

N 

Wb = lim TT oWi , 

N —¥oo 

2 = 0 

where the Wi are independent infinitesimal warps. This corresponds exactly to 
the definition of a Brownian motion on the real axis if the concatenation product 
is replaced by an ordinary sum. 

In order to find this limiting distribution when all Wi are independent, we 
investigate motion in the neighborhood of a single point following along all the 
warps and make the following lemma: 

Lemma 1 (Local structure). Let Jw = dxiW^ he the local Jacobian of W . 
Then, the Jacobian of a Brownian warp 



N 



>Wb 



= lim 
N—¥00 



JWi 



i=0 



Proof. This is obviously true due to the chain rule of differentiation. tJ 

Assume that an infinitesimal warp acts as the infinitesimal independent mo- 
tion of points. In this case, all entries in the local Jacobian are independent and 
identically distributed round the identity. Hence, we may now model 

i=0 '' 

where Hi is a, D x D matrix of independent identically distributed entries of unit 
spread. The denominator '/N is introduced to make the concatenation product 
finite, and a is the spread or the “size” of the infinitesimal warps. 

To summarize, the limiting distribution of Eq. 1 is the distribution of the 
Jacobian of a Brownian Warp. In turn, this defines the Brownian distribution 
on warps, as we have no reason to assume other structure in the distribution. 

Unfortunately, the solution to Eq. 1 is not given in the literature on random 
matrices. Gill and Johansen [10] solve the problem for matrices with positive 
entries and Hognas and Mukherjea [11] solve, among other cases, the situation 
when the matrices are symmetric. Recently, we have solved the case for two 
dimensions [12] and are presently considering the solution for three. Here, we 
present only the result. 
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Theorem 1 (2D Brownian Jacobian). The limiting distribution of Eq.l 
where Hi have independent entries of unit spread and W : IR^ i— >■ is given 

as 

OO 

p{Jws) = G{S/a) ^ gr,{F/a) cos{nQ) , (2) 

n— 0 

where G is the unit spread Gaussian, gn are related to the Jacobi functions, and 
the parameters are given as follows: 

Scaling S = log(def( Jwb)) 

Skewness F = 

Rotation 6 = arctan(4i2^42t-) 

It is shown in [12] that the limiting distribution does not depend on features of 
the infinitesimal distribution other than its spread, cr. This limiting distribution 
is thus least committed in the sense that it arises from the sole assumption of 
invariance under warps. The parameter a may be viewed as a measure of rigidity. 
The effects of the parameters are shown in Fig. 1. 



S 



Scaling 

0.8, F = l, 61 = 05' = 



Skew 

0, F « 2, 61 



= 05 = 



Rotation 

0, F = 1, 61 



0.5 






Fig. 1. The independent action of the parameters on a unit square. 



It has been proved, that this distribution creates invertible warps (with prob- 
ability 1), is invariant under inversion of the warp, and is Euclidean invariant [17]. 
Here we prove that the distribution is invariant under simultaneous and identical 
warping of source and destination. 

Theorem 2 (Local diffeomorphic invariance). The distribution of warps 
given as spatially independent Jacobians each distributed according to Fq. 2 is 
invariant with respect to a diffeomorphism simultaneously acting on source and 
destination. 

Proof. A source and destination are related by a local Jacobean J such that 
ri 2 = Jn\, where ni, ri 2 are local frames in the source and destination image 
respectively. An arbitrary diffeomorphism acts locally on the frames with its 
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Jacoben h. Acting on source and target simultaneously makes 712/1 = J'nih. As 
all /i, 771,77.2 are invertible, obviously J = J'. □ 

This theorem only hold as a local property, but is in general valid for a whole 
warp if an invariant measure is used for integration over the full warp field. 
Construction of such a measure is, however, not trivial in the general case. WE 
will do so for the pairwise image matching problem below. 

For computational purposes it may be convenient to approximate the above 
distribution by a distribution which is also independent in F and 6. This can be 
done in many ways without loosing the symmetry and diffeomorphic invariance. 
However, the semi-group property of concatenation of warps will no longer hold 
exactly. We suggest the following approximation. 

p{J) « (3) 

where G^ is a Gaussian of spread cr. This approximation has a relative error at 
less than 3% for all reasonable values of S, 9, F when cr > 0.4. 

Taken from local points to a global distribution of a full warp, we may assume 
spatial independence of the local Jacobean of the warp. This does not correspond 
to assuming local independent motion of points, but that the local spatial differ- 
ences in motion are distributed independently, just like independent increments 
(gradient) of neighboring points of a function in turn leads to Tikhonov regu- 
larization for functions. Taking this Markov Random Field approach, we may 
say that we formulate a first order MRF on the point motion function. The 
above distribution may then be viewed as Gibbs distributions, and the energy 
or minus-log-likelihood of a full field then reads 

Ei{W) = - log p{W) + c= [ S^ + 29^ + 2aFdx , 

J n 

However, the integration variable is not invariant under the warp, and the func- 
tional will not lead to warp invariance. This may be obtained by using a warp 
invariant integration measure dx: 

Es{W) = -log p{W) + c= [ + 29^ + 2aFdx, 

Jn 

where c is an arbitrary irrelevant constant and x = X\Jdet{J) are integration 
variables invariant under the warp chosen to ensure global as well as local warp 
invariance. It may at first glace seem ad hoc to introduce this invariant measure. 
However it also follow directly from the probabilistic theory if one takes into 
account that after some (of the infinitely many) warps, it is more probable to 
see the areas that have increased in size. This is handled elegantly in the theory 
by Markussen in the Ito integral of the spatio-temporal warp [16] leading to the 
same result. 
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4 Implementation 



In general the warp energy Eq. 3 is augmented by an image or landmark matching 
term, so that the full functional to minimize for a given warp inference task reads 

E{W) = Es{W) + \Ei{W) 



where Ej is an image matching functional such as cross-correlation, mutual in- 
formation, or landmark distance. Unfortunately the energy functional Eq. 3 is 
very non-linear in the coordinate functions, and simple tricks such as eigenfunc- 
tion expansions and derived linear splines are not possible. Therefore we will 
optimize this functional using a PDE as gradient descend. We only concentrate 
on Es as Ej is thoroughly treated elsewhere [9]. 

We treat the energy minimization problem using a gradient descend scheme: 



dtW = 



6E 

Jiv 



6Es SEi 

Jw~Jw 



Here we first concentrate on E' (not using the invariant integration variable x 
but plainly dx): 



6Ej 2logD-2aF SD 1 6\\J\\l 59 

5W D 5W D5W SW 

where J is the Jacoby matrix of W and D = det(J). Using the invariant coor- 
dinates (substituting dx >->■ VDdx) this yields 



SEs Hs/2 + 2logD-2aF SD 69 

SW y/D SW^ y/D 'JkT SW 



where Hg is the local energy so that Eg = J Hgdx. The variations left in these 
equations are very simple as all terms are co-linear or quadratic in coordinate 
functions of W . 

Using E'g directly serves the problem that the solution is no longer source- 
target symmetric as emphasis in the energy varies from point to point with 
respect to the local scaling. Using Eg in its full form using the invariant integra- 
tion variable solves this problem. 

On a bounded domain, this will lead to a simultaneous minimization on the 
size of the domain, to minimize the functional, and hence a bias toward shrinking 
warps. It will no longer give meaningful warps directly. This may be solved by 
fixing the size of the invariant domain directly using a Lagrange multiplier in 
the optimization problem. Now 



-^s-bounded / Hgdx S- X dx 
Jo Jo 



we directly solve for A using the fact that the time evolution of dx vanishes if 

Ei 



A = 



lo^^ 
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By simple calculus of variation we obtain: 



^^^.-bounded _ A7:P + g«/2 + 21ogJJ-2aF 1 S\\J\\l , g. /jg 
SW y/D 5W y/D 5W SW 



where A must be updated along the evolution. As A is in integral measure, this 
actually is not a PDE but a partial integral-differential equation. So far, we 
have no proofs of stability of uniqueness of the solution. However, it works in 
the practical solution. It does not fall within the class for which uniqueness 
has been proved [9]. It also works on a totally different function space, since in 
previous work [9] the warps have been living in component-wise Sobolev spaces 
which has a non-empty intersection with the space of diffeomorphisms. However 
some diffeomorphisms are not in the Sobolev space, and some members of the 
component-wise Sobolev space does fold and are obviously not diffeomorphisms. 

This algorithm guarantees that the resulting warp is a diffeomorphism. It 
corresponds to some degree to the large deformation diffeomorphisms by Joshi 
and Miller [14] in the sense that their formulation also seek a solution composed 
over many time steps. However, we have succeeded in integrating out the time, 
and found the closed form solution for the resulting functional. Hence, we find the 
solution directly by optimizing the warp, and not by optimizing the warp, and 
all the intermediate steps, from source to destination. An interesting theoretical 
link between the two approaches is found in Markussen [16], where a warp-time 
discretization is performed, but where a Brownian motion formulation is used. 

Now turning toward discretization of the above algorithm: For time dis- 
cretization we use a simple explicit scheme with alternating gradient descend 
step along the above variant and update of A. For spatial discretization we ap- 
proximate for every grid point the Jacoby as both forward and backward scheme 
in both coordinate giving the combinations of totally 12 discrete Jacobians in 
every grid point (see Fig. 2). Let us denote these Jj, t G [1; 12]. The discretization 




Fig. 2. In every point (xi,yi), the local Jacobean is estimated from the 12 local discrete 
frames including the point. To the left, the four frames, where the point contributes 
centrally, are illnstrated, whereas the 8 frames where the point contributes to in ex- 
tremal position are illustrated to the right. 
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of the variation is then performed as 

5Ed — 2aFt 6Dt 1 'JH-Ztlli r, 

JW ~ ^ W 5W * W 



The summation can not be performed by first summing over the variational parts 
as for example 



y ^ = 0 

^ 5W 



At the boundary, the contributions from the discrete Jacobean leaving the 
domain are neglected, as the free boundary conditions are implemented in this 
way. 



5 Results 

We see from the energy formulation that the rigidity parameter determines the 
relative weight of the skewness term to the scaling and rotation terms. For illus- 
tration of the independent terms, see Fig. 3. For large deformations, the differ- 
ence to spline-based methods, becomes obvious as for example thin plate splines 
can introduce folds in the warping (see Fig. 4). 

For testing the source-target symmetry we conducted the following experi- 
ment. We kept the boundary fixed and moved two random points in the interior 
with a Brownian motion to new random positions (see Fig 5). 

The figures clearly show that the warp generated by the above algorithm is 
statistically significant more symmetric than thin-plate spline warps. The motion 




Fig. 3. Illustration of deformation of a regular grid. Two points in the center have 
been moved up and down respectively, while the corners are kept fixed. We see that 
the scaling term (top left) aims at keeping the area constant. The skewness term 
(bottom left) aims at keeping the stretch equally large in all directions. Top right is a 
combination of scaling and skewness (cr = 1). Bottom right is a thin plate spline for 
comparison. 
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Fig. 4. Leftmost are two images of large deformations: Left is the maximum likelihood 
Brownian warp, right is a thin plate spline. Rightmost two images are two consecutive 
warps where landmark motions are inverse: Left is Brownian warps, right is thin plate 
spline. Brownian warps do not give the exact inverse due to numerical impression, but 
closer than the thin plate spline. 






Fig. 5. Top-left the warp carried out at random is illustrated. Top-right is the fraction 
of thin-plate warps that contains a fold (is not invertible) as function of the spread 
of the random motion of the two interior point. Below is the absolute error in pixel 
position warping forward and concatenating with the backward warp. To the right is 
the same for the relative error of the Brownian warps and the thin-plate warps. 25 
runs for each standard deviation on a 50 x 50 grid was performed. All error bounds are 
bootstrapped 90% confindence intervals. 



of points after warping forward and back are less than a third than in the case 
of thin plate splines. Hence, not only is the theory symmetric, implementations 
show significant improvements. However, the warps are not totally symmetric, 
which to our opinion is due to the spatial discretization, as the error is smaller 
on a 50 X 50 grid than on a 10 x 10 grid. 
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Fig. 6. As previous figure, but with 200 runs for each standard deviation on a 10 x 10 
grid. 






Fig. 7. Top-left a mammography taken in 1999. Top-right a mammography of the 
same breast taken in 2001 after hormonal treatment. Bottom-left 1999 breast warped 
to 2001. Bottom-right difference image. The correlaton of intensities are 0.72 without 
warping, 0.86 with thin plate warping, and 0.92 with Brownian warping. 



6 Conclusion 

We have exploited a prior for warps based on a simple invariance principle un- 
der warping. This distribution is the warp analogue of Brownian motion for 
additive actions. An estimation based on this prior guarantees an invertible, 
source-destination symmetric, and warp-invariant warp. When computational 
time is of concern, approximations can be made which violate the basic semi- 
group property while maintaining the invariances. For fast implementations, we 
recommend an approximation including only the skewness term, as this has nice 
regularizing properties. 
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We have developed a PDE scheme for implementing an algorithm computing 
the maximum-likelihood warp. We have tested this in the case of exact land- 
mark matching, and shown that it does not fold (as theory predicts) as linear 
approaches will do, and shown that also in discrete approximation, the scheme 
yields solutions very close to being source-target symmetric. 

Future work includes a joint optimization scheme with other image matching 
terms as used earlier [9]. As a final illustration we show an example of warping 
a mammogram in Figure 7, where the Brownian warping increases the intensity 
correlation compared to thin-plate warping. 
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Abstract. This paper proposes the use of a particle hlter combined 
with color, depth information, gradient and shape features as an efficient 
and effective way of dealing with tracking of a head on the basis of 
image stream coming from a mobile stereovision camera. The head is 
modeled in the 2D image domain by an ellipse. A weighting function 
is used to include spatial information in color histogram representing 
the interior of the ellipse. The lengths of the ellipse’s minor axis are 
determined on the basis of depth information. The dissimilarity between 
the current model of the tracked object and target candidates is indicated 
by a metric based on Bhattacharyya coefficient. Variations of the color 
representation as a consequence of ellipse’s size change are handled by 
taking advantage of the scale invariance of the similarity measure. The 
color histogram and parameters of the ellipse are dynamically updated 
over time to discriminate in the next iteration between the candidate 
and actual head representation. This makes possible to track not only a 
face prohle which has been shot during initialization of the tracker but in 
addition different prohles of the face as well as the head can be tracked. 
Experimental results which were obtained on long image sequences in a 
typical office environment show the feasibility of our approach to perform 
tracking of a head undergoing complex changes of shape and appearance 
against a varying background. The resulting system runs in real-time on 
a standard laptop computer installed on a real mobile agent. 



1 Introduction 

Visual tracking of objects in video sequences is becoming an important task in 
a wide range of applications utilizing computer vision interfaces, including hu- 
man action recognition, teleconferencing, robot teleoperation as well as human- 
computer interaction. Many different trackers for various tasks have been de- 
veloped in recent years and particular interests and research activities have 
increased significantly in vision-based methods. One of the purposes of visual 
tracking is to estimate the states of objects of interest from an image sequence. 
However, cluttered backgrounds, unknown and changing lighting conditions and 
multiple moving objects make the vision-based tracking tasks challenging. Some 
vision-based systems allow a determination of a body position and real-time 
tracking of head and hands. Pfinder [1] uses a multi-class statistical model of 
color and shape to obtain a blob representation of the tracked silhouette in a 
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wide spectrum of viewing conditions. In the techniques known as CamShift [2] 
and MeanShift [3] the current frame is searched for a region in a variable-size 
window, whose color content matches best a reference model. The searching pro- 
cess proceeds iteratively starting from the final location in the previous frame. 
The new object location is calculated based on the mean shift vector as an esti- 
mation of the gradient of the Bhattacharyya function. This method requires that 
the new target center lies within the kernel centered on the previous location of 
the target. The original application of the particle filter in computer vision was 
for object tracking in an image sequence [4]. Particle filtering is now a popular 
solution to problems relying on visual tracking. In the work of [5] a fixed ellipse 
is used to approximate the head outline during 2D tracking on the basis of the 
particle filter. A system developed recently by Chen et al. [6] uses a causal ID 
contour model in dynamic programming to find the best contour with respect to 
a given initial one. A five dimensional ellipse is used to represent the head con- 
tour in multiple hypothesis framework. Nummiaro et al. [7] used an ellipse with 
fixed orientation to model a head and to extract the color distribution of the 
ellipse’s interior. The likelihood is calculated on the basis of weighted histogram 
representing both color and shape of the head. Global color reference models and 
Bhattacharyya coefficient as a similarity measure between the color distribution 
of the model and target candidates have been used in a Monte Carlo tracker 
[8]. A histogram representation of the region of interest has been extracted in 
a rectangular window. Recently, the laser range fingers have been used to track 
people in populated environments for interactive robot applications [9]. 

In this paper, we focus our attention on tracking human head/face, one of 
the most important features in tasks consisting in people tracking and action 
recognition. The main objective of the research is to detect and track the head 
to perform person following with a real mobile agent which is equipped with an 
on-board camera. The initial position of the head to be tracked is determined 
by means of face detection. We consider scenarios where a stereo camera is 
mounted on a mobile agent and our aim is tracking the head which can undergo 
complex changes of shape and appearance. The appearance of the object of 
interest changes continuously due to non-rigid human motion and a change in 
viewpoints. There are many other difficulties in extracting features distinguishing 
the target and challenge lies in the fact that a background may not be static. 
We consider the problem of head tracking by taking advantage of gradient, color 
together with shape as well as depth information which are combined with the 
particle filter. One of the problems of tracking on the basis of color is that 
lighting conditions may have influence on perceived color of the target. Even 
in the case of constant lighting conditions, the seeming color of the target may 
change over a frame sequence, since the target can be shadowed by other objects. 
The color distributions representing the target in image sequences are therefore 
not stationary. 

The main goal of the tracker is to find the most probable sample distribu- 
tion. The particles representing the candidate ellipses are verified in respect of 
intensity gradient near the edge of the ellipse and matching score of the color 
histograms representing the interior of an ellipse surrounding the tracked object 
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and currently analyzed one. During samples weighting stage in which candidate 
ellipses are considered one after another, the projected ellipse size into image is 
dependent on the depth information. The color histogram and parameters of the 
ellipse are dynamically updated over time to discriminate in the next iteration 
between the candidate and actual head representation. 

The contribution of our work lies in the use of particle filters combined with 
mentioned above cues to robustly solve a difficult and a useful problem of head 
tracking in color images. The tracker has been evaluated in experiments consist- 
ing in face tracking with a stereovision camera mounted on a real mobile agent. 
A version of the tracker which utilizes gradient, color as well as shape infor- 
mation combined with particle filters has been evaluated using the PETS-ICVS 
2003 video data set which is provided to conduct experiments relating to smart 
meeting room. 

The rest of the paper is organized as follows. In the next section we briefly 
describe particle filtering. The usage of color cue, gradient, shape information 
and stereovision in a particle filter is explained in section 3. In sections 4 and 5 
we report results which were obtained in experiments. Finally, some conclusions 
follow in the last section. 



2 Particle Filtering 



In this section we formulate the visual tracking problem in a probabilistic frame- 
work. Among the tracking methods, the ones based on particle filters have at- 
tracted much attention recently and have proved as robust solutions to reduce 
the computational cost by searching only those regions of the image where the 
object is predicted to be. The key idea underlying all particle filters is to ap- 
proximate the probability distribution by a weighted sample collection. 

The state of the tracked object at time t is denoted X( and its history is 
Xt = {xi,...,xt}. Similarly the set of image features at time t is z* with his- 
tory Zt = {zi,...,zt}. The evolution of the state forms a temporal Markov 
chain so that the new state is conditioned directly on the immediately preceding 
state and independent of the earlier state, p(xj | Xt-i) = p{^t \ xt_i). Ob- 
servations Zt are assumed to be independent, both mutually and with respect 
to the dynamical process, p(Zt_i,Xt | Xt_i) = p(xt | \ x^). 

The observation process is defined by the conditional density p(zt | Xf). Given a 
continuous-valued Markov chain with independent observations, the conditional 
state density p(xf | Zt) represents all information about the state at time t that 
is deducible from the entire data-stream up to that time. 

We can use Bayes’ rule to determine the a posteriori density p(xj \ Zt) = 
p{s.t I from the a priori density p(xj | Zt-i) in the following manner 



P(xt I Zt) 



p{zt I xt,Zt-i)p{xt I Zt-i) 

p{zt \Zt-i) 



ktp{zt I xt)p(xt I Zt-i) 



where kt is a normalization factor that is independent of x and 



( 1 ) 



p(xt \ Zt_i) 



p{yit I xt_i)p(xt_i I Zt-i)dxt-i 



( 2 ) 
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This equation is used to propagate the probability distribution via the tran- 
sition density p(xt | Xj_i). The density function p(x( | Zt-i) depends on the im- 
mediately preceding distribution p(xj_i | but not on any function prior to 

t — 1, so it describes a Markov process. Multiplication by the observation density 
p(zi I X() in the equation for a priori density p(xt | Zt-i) applies the reactive 
effect expected from observations. The observation density p{zt \ Xj) defines the 
likelihood that a state X( causes the measurement zj. The complete tracking 
scheme, known as the recursive Bayesian filter first calculates the a priori den- 
sity p(xt I Zt-i) using the system model and then evaluates a posteriori den- 
sity p(xj I Zt) given the new measurement, p(xt~i \ Zt-i) dynamics^ p(xt \ 

Zt-i) rneasurenient^ | 2;^). 

The density p(xt | Zt) can be very complicated in form and can have multiple 
peaks. The need to track more than one of these peaks results from the fact that 
the largest peak for any given frame may not always correspond to the right 
peak. The random search which is known as particle filtering has proven useful in 
such considerable algorithmic difficulties and allows us to extract one or another 
expectation. One of the attractions of sampled representations of probability 
distributions is that some calculations can be easily realized. 

Taking a sample representation of p(x( | Zt), we have at each step t a set 
St = I n = l...fv| of N possibly distinct samples, each with associ- 

ated weight. The sample weight represents the likelihood of a particular sample 
being the true location of the target and is calculated by determining on the basis 
of depth information the ellipse’s minor axis and then by computing the gradient 
along ellipse’s boundary as well as matching score of histograms representing the 
interior of ellipses which bound (i) the tracked object and (ii) currently consid- 
ered one. Such a sample set composes a discrete approximation of the probability 
distribution. The prediction step of Bayesian filtering is realized by drawing with 
replacement N samples from the set computed in the previous iteration, using 
the weights 7T(”\ as the probability of drawing a sample, and by propagating 
their state forward in time according to the prediction model p(xt | X(_i). This 
corresponds to sampling from the transition density [10]. The new set would 
predominantly consist of samples that appeared in previous iteration with large 
weights. In the correction step, a measurement density p{zt \ x*) is used to 
weight the samples obtained in the prediction step, = p(zt | Xj = s^”^). The 
complete scheme of the sampling procedure outlined above can be summarized 
in the following pseudo-code: 



St = 0 

for n = 0 to do 

select k with probability 7 T("i/ '^ t-i 
propagate -|- w 

calculate non-normalized weight = p(zt \ sj”^) 
add to St 
endf or 
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The component A in the propagation model is deterministic and ic is a multivari- 
ate Gaussian random variable. As the number of samples increases, the precision 
with which the samples approximate the pdf increases. The mean state can be 
estimated at each time step as E[St] = J2n=i where are normalized 

to sum to 1. 

3 Representation of the Target Appearance 

The shape of the head is one of the most easily recognizable human parts and 
can be reasonably well approximated by an ellipse. In work [11] a vertically 
oriented ellipse has been used to model the projection of a head in the image 
plane. The intensity gradient near the edge of the ellipse and a color histogram 
representing the interior were used to handle the parameters of the ellipse over 
time. Additionally, this method assumes that all pixels in the search area are 
equally important. The discussed tracking method does not work when the object 
being tracked temporarily disappears from the camera view or changes shape 
significantly between frames. In the method proposed here, an ellipse-based head 
likelihood model, consisting of gradient along the head boundary as well as a 
matching score between color histograms as a representation of the interior of 
(i) an ellipse surrounding the tracked object and (ii) a currently considered 
ellipse, together with depth information is utilized to find the weights of particles 
during tracking. Particle locations where the weights have large values are then 
considered to be the most likely locations of the object of interest. The particle 
set improves consistency of tracking by handling multiple peaks representing 
hypotheses in the distribution. 

Although the use of color discrimination is connected with some fundamen- 
tal problems such as the lack of robustness in varying illumination conditions, 
color is perceived as a very useful discrimination cue because of its compu- 
tational efficiency and robustness against changes in target orientations. The 
human skin color filtering has proven to be effective in several settings and has 
been successfully applied in most of the face trackers relying primarily on color 
[12], [13], [14], [15] or on color in conjunction with other relevant information [16]. 
Color information is particularly useful to support a detection of faces in image 
sequences because of robustness towards changes in orientation and scaling of an 
appearance of object being in movement. The efficiency of color segmentation 
techniques is especially worth to emphasize when a considered object is occluded 
during tracking or is in shadow. 

In our approach we use color histogram matching techniques to obtain in- 
formation about possible location of the tracked target. The main idea of such 
an approach is to compute a color distribution in form of the color histogram 
from the ellipse’s interior and to compare it with the computed in the same 
manner histogram representing the tracked object in the previous iteration. The 
better a histogram representing the ellipse’s interior at specific particle position 
matches the reference histogram from previous iteration, the higher the proba- 
bility that the tracked target at considered candidate position is. The outcome 
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of the histogram matching that is combined with gradient information is used 
to provide information about expected target location and is utilized during 
weighting particles. 

In the context of head tracking on the basis of images from a mobile camera 
the features which are invariant under head orientations are particularly useful. 
In general, histograms are invariant to translation and rotation of the object and 
they vary slowly with the change of angle of view and with the change in scale. 
The histogram is constructed with a function h : B? — >■ {1...K} which associates 
the color at location y to the corresponding bin. A histogram representation can 
be obtained in a simple way by quantizing the ellipse’s interior colors into K 
bins and counting the number of times each discrete color occurs. Due to the 
statistical nature, a color histogram can only reflect the content of images in a 
limited way and thus the contents of the interior of the ellipses taken at small 
distances apart are strongly correlated. If the number of bins K is to high, the 
histogram is noisy. If K is too low, density structure of the image representing the 
ellipse’s interior is smoothed. Histogram-based techniques are effective only when 
K can be kept relatively low and where sufficient data amounts are available. 
The reduction of bins makes a comparison between the histogram representing 
the tracked head and the histogram of candidate head faster. Additionally, such 
a compact representation is tolerant to noise that can result from imperfect 
ellipse-approximation of a highly deformable structure and curved surface of a 
face causing significant variations of the observed colors. The particle filter works 
well when the conditional densities p(zt | s*) are reasonably flat. 

It can be demonstrated that with a change of lighting conditions the major 
translation of skin color distribution is along the lightness axis of the RGB color 
space. Skin colors acquired from a static person tend to form tight clusters in 
several color spaces while color acquired from moving ones form widen clusters 
due to different reflecting surfaces. To make the histogram representation of 
the tracked head less sensitive to lighting conditions the HSV color space has 
been chosen and the V component has been represented by 4 bins while the HS 
components obtained the 8-bins representation. 

The histogram intersection technique [17] is a popular measure between 
two distributions represented by a pair of histograms / and M, each contain- 
ing L values. The intersection of the histograms is defined as follows: H = 
where the terms represent the number of pix- 

els inside the M-th bucket of the candidate histogram in the current frame and 
the histogram representing the tracked head in the previous frame, respectively, 
whereas K the total number of buckets. The result of the intersection of two 
histograms is the number of pixels that have the same color in both histograms. 
To obtain a match value between zero and one the intersection is normalized 
and the match value is determined as follows: i7n = ■ The work [3] 

demonstrated that the metric a/1 — /?(/, M) derived from Bhattacharyya coeffi- 
cient p is invariant to the scale of the target and therefore is superior to other 
measures such as histogram intersection or Kullback divergence. Considering 
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discrete densities the considered coefficient is defined as follows 

K 

p{I, M) = ^ 

U — 1 



( 3 ) 



Given the center of the target, a feature distribution including spatial informa- 
tion in color histogram can be calculated using a 2-dimensional kernel centered 
on the target center [18]. The kernel is used to provide the weight for a partic- 
ular feature according to its distance from the center of the kernel. In order to 
assign smaller weights to the pixels that are further away from the region center 
a nonnegative and monotonic decreasing function fc : [0, oo) — >■ R can be used 
[18]. The probability of particular histogram bin u at location y is calculated as 



Sy^=CaY^k 

2 = 1 




<5 [Kvi) 



( 4 ) 



where ^y^ are pixel locations of the face candidate, L is the number of pixels in 
the region, 5 is the Kronecker delta function and constant a is the radius of the 
kernel. The normalization factor 



Ca = 



L 

Ek 

i-1 



y-Yi 



ensures that ^ = 1- This normalization constant can be precalculated 

[3] for the utilized kernel and assumed values of a. The 2-dimensional kernels 
have been prepared offline and then stored in lookup tables for the future use. 

The length of the minor axis of a considered ellipse is determined on the basis 
of depth information. Taking into account the length of the minor axis resulting 
from the depth information we also considered smaller and larger projection 
scale of the ellipse and therefore a larger as well as smaller minor axis about 
one pixel have been taken into account as well. The length of the minor axis 
has been maintained by performing the local search to maximize the goodness 
of the following match: w* = a,rgina,Xujiew{G{wi)Hs{wi)}, where G and Hs are 
normalized scores based on intensity gradients and color histogram similarity. 
In order to favor head candidates whose color distributions are similar to the 
target color distribution we utilized Gaussian weighting with cr variance [7] 



Hs 



1 _ i-p 

/ — e ^ 

V^cr 



( 5 ) 



where small Bhattacharyya distances correspond to large matching scores. The 
search space W comprises the ellipse’s length obtained on the basis of depth 
information as well as smaller/larger minor axes about one pixel. 

The samples are propagated on the basis of a dynamic model St = ^Si_i -|- wt, 
where A denotes a deterministic component describing a constant velocity move- 
ment and Wt is a multivariate Gaussian random variable. The diffusion compo- 
nent represents uncertainty in prediction and therefore provides a way of per- 
forming a local search about a state. The weight of each hypothetical head region 
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(n) 

7T( ^ is dependent on normalized intensity gradients and color histogram simi- 
larity which were obtained for the length of minor axis w*. 

The elliptical upright outlines with an assumed fixed aspect ratio equal to 1.4 
have been prepared and stored for the future use in the construction phase. For 
each possible length of the minor axis we prepared off-line an elliptical outline to 
compute gradient and kernel lookup table to include spatial information in color 
histograms. Expanding the algorithm about non-upright ellipses is straightfor- 
ward. 

The histogram representing the tracked head has been adapted over time. 
This makes possible to track not only a face profile which has been shot during 
initialization of the tracker but in addition different profiles of the face as well as 
the head can be tracked. The actualization of the histogram has been realized on 
the basis of the equation = (l — a)Mj^\+alj;^\ where a is accommodation 
rate, It represents the histogram of the interior of the mean state ellipse, Mt the 
histogram of the target from previous frame, whereas u = 

4 Tracking on the Basis of Moving Camera 

A kind of human-machine interaction which is useful in practice and can be very 
serviceable in testing a robustness of a tracking algorithm is person following 
with a mobile robot. In work [19] the condensation-based algorithm is utilized 
to keep track of multiple objects with a moving robot. The tracking experiments 
described in this section were carried out with a mobile robot Pioneer 2 DX [20] 
equipped with commercial binocular Megapixel Stereo Head. The dense stereo 
maps are extracted in that system thanks to small area correspondences between 
image pairs [21] and therefore poor results in regions of little texture are often 
provided. The depth map covering a face region is usually dense because a human 
face is rich in details and texture, see Fig. 1. Thanks to such a property this 
stereovision system provides a separate source of information and considerably 
supports the process of approximating the tracked head with an ellipse. 

A typical laptop computer equipped with 2 GHz Pentium IV is utilized to 
run the prepared visual tracker operating at 320x240 images. The position of 
the tracked face in the image plane as well as person’s distance to the camera 
are written asynchronously in block of common memory which can be easily 




Fig. 1. Depth images (frame 1 and frame 600) 
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accessed by Saphira client. Saphira is an integrated sensing and control system 
architecture based on a client server-model whereby the robot supplies a set of 
basic functions that can be used to interact with it [20]. Every 100 milliseconds 
the robot server sends a message packet containing information on the velocity 
of the vehicle as well as sensor readings to the client. During tracking, the control 
module keeps the user face within the camera field of view by coordinating the 
rotation of the robot with the location of the tracked face in the image plane. 
The aim of the robot orientation controller is to keep the position of the tracked 
face at specific position in the image. The linear velocity has been dependent on 
person’s distance to the camera. In experiments consisting in person following 
a distance 1.6 m has been assumed as the reference value that the linear ve- 
locity controller should maintain. To eliminate needless robot rotations as well 
as forward and backward movements we have applied a simple logic providing 
necessary insensitivity zone. The PD controllers have been implemented in the 
Saphira-interpreted Colbert language [20]. The tracking algorithm was imple- 
mented in C/C-l— I- and runs at a frame rate about 10 Hz depending on image 
complexity. 

We have undertaken experiments consisting in following a person facing the 
camera within walking distance without the tracked face loss. Experiments con- 
sisting in realization of only a rotation of mobile robot which can be seen as 
analogous to experiments with a pan-camera have also been conducted. In such 
experiments a user moved about a room, walked back and forth as well as around 
the mobile robot. The aim of such a scenario was to evaluate the quality of ellipse 
scaling in response of varying distance between the camera and the user dur- 
ing person following. Our experiment findings show that thanks to stereovision 
the ellipse is properly scaled and therefore because of appropriate head approx- 
imation, sudden changes of the minor axis length as well as ellipse’s jumps are 
considerably eliminated. Figure 2 indicates selected frames from the discussed 
scenario, see also Fig. 1. The color of the door is very similar to that of hu- 
man face and it can cause great difficulty to color-based tracking algorithms, 
see also image from frame 390 in Fig. 2. The region cue reflected by weighted 
color histogram varies slowly with slow translation of the target but does not ex- 
press appropriately the content of the image with reduced scale, see image from 
frame 600 in Fig. 2. The likelihood model combining gradient information with 
a weighted histogram of the ellipse’s interior demonstrated abilities to localize 
target correctly in case of reduced scale. The gradient modality complement the 
color modality when the object is moving because color information may become 
unreliable due to changes in the object pose and illumination, whereas strong 
localization cues may be obtained from the gradient information. The gradient 
information can therefore improve the accommodation of the color model over 
time. In particular, the depth information allows us to set accommodation rate 
a to zero when face is localized above an assumed distance to the camera. 

The depth map covering the face region is usually dense and this together 
with skin-color and symmetry information as well as eyes-template assorted with 
the depth has allowed us to apply the eigenfaces method [22] and to detect the 
presence of the vertical and frontal-view faces in the scene very reliably and 




Stereovision-Based Head Tracking 



201 



thus to initialize the tracker automatically. Thanks to the head position it is 
possible to recognize some static commands on the basis of geometrical relations 
of the face and hands and to interact with mobile robot during person following. 
Using the discussed system we have realized experiments in which the robot has 
followed a person at distances which beyond 100 m without the person loss. By 
dealing with multiple hypotheses this approach can track a head reliably in cases 
of temporal occlusions and varying illumination conditions. 




Fig. 2. Face tracking relying only upon a rotation of the moving camera 
(frames 1,35,390,600) 



5 Evaluation Using PETS-ICVS Data Sets 

The experiments described in this section have been realized on the basis of 
PETS-ICVS data set which has been prepared in smart meeting room. The aim 
of the experiments was to track the meeting participants based on static color 
camera. The images of size 576x720 have been converted to size of 320x240 by 
subsampling (consisting in selecting odd pixels in only odd lines) and bicubic 
based image scaling. Initialization of the tracker has been performed by searching 
for an elliptical object in determined in advance head-entry and head-exit zones. 
A simple background subtraction procedure which was executed in mentioned 
above boxes has proven to be sufficient in person entry/exit detection. In this 
version of the tracker a sample in distribution represents an ellipse described by 
s = where y denotes the location in the xy-image plane, y motion, I 

the length of the minor axis and I corresponding scale change. 

Figure 3 depicts example frames from a typical experiment of Scenario C 
which was viewed from Camera 1. The frame 13667 demonstrates a behavior 
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of the tracker in case of non-upright head orientation. Because of only vertical 
orientation of the ellipses which has been assumed in advance, the tracker fitted 
an ellipse in a search region in the proximity of the true location. Such a tug 
work of the tracker has been observed at 15 succeeded frames and after that 
the algorithm continued a smooth tracking of the head. In frames 13765, 13917, 
14140 we can perceive a poor approximation of the head of the third person 
by an ellipse. But such undesirable effect has been observed occasionally during 
processing of PETS data sets. The number of poor misfits can be greatly reduced 
by utilizing the nearly constant distance of the tracked person to the camera and 
thus by operating with smaller range of lengths of the ellipse’s minor axis. The 
experiments described in this section have been conducted using a relatively large 
range of the axis lengths which were needed during person following, namely from 
6 to 30. A typical length of the ellipse’s axis for the presented in the Fig. 3 frame 
range is about 11. Another method of improving the robustness of the tracker 
in situations where misfits have been observed is to combine it with fast and 
robust algorithm for detecting faces with out-of-plan rotation [23]. 




Fig. 3. Frames 11224, 13667, 13765, 13917, 14140, 14842 of Scenario C 



Figure 4 illustrates example frames of tracking on the basis of the CamShift 
algorithm [2]. The tracker has been initialized in frame 10952, see the left frame 
in Fig. 4, with number of bins equals 30, Smin=40 and Vmin=60. 

6 Conclusion 

We have presented a vision module that robustly tracks and detects a human 
face. By employing shape, color, stereovision as well as elliptical shape features 
the proposed method can track a head in case of dynamic background. The com- 
bination of above-mentioned cues and particle filter seems to have a considerable 
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Fig. 4. Face tracking using CamShift 



perspective of applications in robotics and surveillance. The algorithm is robust 
to sensor noise and uncertainty in localization. The resulting particle filter was 
able to track the head reliably, even during varying lighting conditions. More- 
over, the particle filtering performs satisfactory even in the presence of partial 
occlusions. To show the correct work of the system, we have conducted several 
experiments in naturally occurring in laboratory circumstances. In particular, 
the tracking module enables the robot to follow a person. Thanks to the real- 
time robot control, the moving camera provides a considerably large searching 
area for a vision system. Face tracking can be used not only for directing the 
vision system’s attention to a user/intruder but also as a prerequisite stage for 
face recognition and human action understanding. One of the future research 
directions of the presented approach is to explore the unscented particle filter 
[7], [24]. One difficulty of utilizing of gradient along the head boundary is the high 
nonlinearity of the observation likelihood and even small difference in parameters 
of the ellipse could involve large changes in likelihood. The unscented particle fil- 
ter places the limited particles in an effective way in comparable computational 
overhead over the conventional particle filtering scheme. 
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Abstract. We present an approach to parallel variational optical flow computation 
on standard hardware by domain decomposition. Using an arbitrary partition of 
the image plane into rectangular subdomains, the global solution to the variational 
approach is obtained by iteratively combining local solutions which can be effi- 
ciently computed in parallel by separate multi-grid iterations for each subdomain. 
The approach is particularly suited for implementations on PC-clusters because 
inter-process communication between subdomains (i.e. processors) is minimized 
by restricting the exchange of data to a /ower-dimensional interface. By applying 
a dedicated interface preconditioner, the necessary number of iterations between 
subdomains to achieve a fixed error is bounded independently of the number of 
subdomains. Our approach provides a major step towards real-time 2D image pro- 
cessing using off-the-shelf PC-hardware and facilitates the efficient application of 
variational approaches to large-scale image processing problems. 



1 Introduction 

1.1 Overview and Motivation 

Motion estimation in terms of optical flow [3,25] is an important prerequisite for many 
applications of computer vision including surveillance, robot navigation, and dynamic 
event recognition. Since real-time computation is required in many cases, much work has 
been done on parallel implementations of local motion estimation schemes (differential-, 
correlation-, or phase-based methods) [31,14]). 

In contrast to local estimation schemes, less work has been done on the parallelization 
of non-local variational schemes for motion estimation, despite considerable progress 
during the last years related to robustness, non-linear regularization schemes, preserva- 
tion of motion boundaries, and corresponding successful applications [27,26,13,1,1,21, 
5,28,20,18,4,15,23,32,11,24]. It is precisely the non-local nature of these approaches 
which on the one hand allows to impose structural constraints on motion fields during 
estimation but, on the other hand, hampers a straightforward parallel implementation by 
simply partitioning the image domain into disjoint subdomains (see Figure 1). 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3024, pp. 205-216, 2004. 
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Fig. 1. Top: A partition of the image domain Q into subdomains {f?*} and a lower-dimensional 
interface F. Bottom, left: Motion field estimated with a non-local variational approach. Bottom, 
center: Naive parallelization by estimating motion independently in each subdomain leads to 
boundary effects (a coarse 3x3 partition was used for better visibility). Bottom, right: The 
^ 2 -error caused by naive parallelization as grayvalue plot. The global relative error is 11.3%. The 
local error near boundaries of subdomains is much higher! 



This motivates to investigate computational approaches for the parallelization of 
variational motion estimation by means of domain decompositions as shown in Figure 
1, top. Ideally, any of the approaches cited above should be applicable independently in 
each subdomain. In addition to that, however, a mechanism is needed which fits together 
the subdomain solutions so as to yield the same global solution which is obtained when 
applying the respective approach in the usual non-parallel way to the entire image domain 
Q. The investigation of such a scheme is the objective of this paper. 

1.2 Contribution and Organization 

We present an approach to the parallelization of variational optical flow computation 
which fulfils fhe following requiremenfs: 

(i) Suifabilify for fhe implementation on PC-clusters through the minimization of inter- 
process communication, 

(ii) use of a mathematical framework which allows for applications to a large class of 
variational models. 

In order to meet requirement (ii), our approach draws upon the general mathematical 
theory on domain decomposition in connection with the solution of partial differen- 
tial equations [9,30,29]. Requirement (i) addresses a major source for degrading the 
performance of message-passing based parallel architectures. To this end, we focus on 
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the subclass of substructuring methods because inter-process communication is mini- 
mized by restricting the exchange of data to a lower-dimensional interface between the 
subdomains. 

After sketching a prototypical variational approach to motion estimation in section 2, 
we develope a corresponding domain decomposition framework in section 3 . In section 4, 
we describe features of our parallel implementation, the crucial design of an interface- 
preconditioner, and report the influence of both preconditioning and the number of 
subdomains on the convergence rate. Finally, we report measurements for experiments 
on a PC-cluster with nodes in section 5. 



2 Variational Motion Estimation 

In this section, we sketch the prototypical variational approach of Horn and Schunck 
[19] and its discretization as a basis for domain decomposition. Note that our formula- 
tion suffiently abstracts for this particular approach. As a consequence, the framework 
developed in section 3 can be applied to more general variational approaches to motion 
estimation, as discussed in section 1.1. 



2.1 Variational Problem 

Throughout this section, g{x) = g{x\,X 2 ) is the grayvalue function, V = {dx ^ , dx^V 
denotes the gradient with respect to spatial variables, dt the partial derivative with respect 
to time, and u = (ui,U 2 )^ ,v = (ui,U 2 )^ denote vector fields in the linear space 
V = X H^. 

With this notational convention, the variational problem to be solved reads [19]: 

J(u) = inf [ {{Vg-vd-dig)"^ l-\Vv 2 \‘^)}dx (1) 

’'6V Jq 

Vanishing of the first variation of the functional J in (1) yields the variational equa- 
tion: 



a{u,v) = f{v), Vw G V , 



( 2 ) 



where 



i{u,v) = / {(V(/ • u)(V 5 • u) + A(Vui • Vui -f Vu 2 • Vu 2 )}<ia; (3) 

JQ 

f{v) = - f dtgVg • vdx (4) 

Jq 



Under weak conditions with respect to the image data g (i.e. dx^g and dx^g have to 
be independent in the L^-sense, there exists a constant c > 0 such that: 



a{v, v) > c\\v\\y , Vu G U 



(5) 
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As a consequence, J in (1) is strictly convex and its global minimum u is the unique 
solution to the variational equation (2). Partially integrating in (2), we derive the system 
of Euler-Lagrange equations: 

Lu = f in Q , dnU = 0 on dil , (6) 

where 



Lu = —XAu + {Vg ■ u)Vg 



2.2 Discretization 

To approximate the vector field u numerically, equation (2) is discretized by piecewise 
linear finite elements over the triangulated section 17 of the image plane [10]. We arrange 
the vectors of nodal variables u \ , U 2 corresponding to the finite element discretizations 
of ui(x),U 2 (x) as follows': u = Taking into consideration the symmetry 

of the bilinear form (3), this induces the following block structure of the discretized 
version Au = / of (2): 

^11 //l\ .j. 

A2lA22j\u2j U/ ’ 

where \/i,j = 1, . . . ,N: 

(An)y=a((0*,O)^, (</>,-, 0)^) 

(^i2)ii = 0)^, (0, 

= {Al2)ji 

(7l22)y =a((0, <^0^,(0, 

(/2). = /((O,0.)^) 

Here, (f>k denotes the linear basis function corresponding to the nodal variable {ui)k or 
(u 2 ) fe , respectively. 



3 Domain Decomposition 



3.1 Two Subdomains 

Let f?' U 17^ be a partition of 17 with a common boundary T = 17^ fl 17^. We denote 
the corresponding function spaces with V^, V^. In the following, superscripts refer to 
subdomains. 

We wish to represent u from (6) by two functions G V^,u^ G which are 
computed by solving two related problems in 17', 17^, respectively. The relation: 



u(x) 



u^(x) X G 17' 
u^(x) X G 17^ 



* With slight abuse of notation, we use the same symbols ui,U2, for simplicity. 



( 8 ) 
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obviously holds iff the following is true: 

Lu^ = in 17^ = 0 on fi (9) 

Lu^ = in 17^ 9„2tt^ = 0 on 9l7^ fi 9l7 (10) 

u^=u^ onr (11) 

dn^u^ = — on r (12) 



We observe that (6) cannot simply be solved by separately computing and in each 
domain l7i, f ?2 (see also Fig. 1!) because the natural boundary conditions have to be 
changed on F, due to (11) and (12). 

As a consequence, in order to solve the system of equations (9)-(12), we equate the 
restriction to the interface F of the two solutions to (9) and (10), due to equation 

(11), Ur '■= u^lr = and substitute up into equation (12) (see Eqn. (19) below). 



To this end, we solve 

Lu = f in 17 , dnU = 0 on df2\F, u = ur on T (13) 

and decompose u into two functions, 

U = Uq + Uf , 

which are the unique solutions to the following problems: 

Luo = 0 in 17 , dnUo = 0 on 917 \ T , uq = up on F (14) 

Luf = f in 17 , dnUf = 0 on 917 \ T , m/ = 0 on F (15) 

Clearly, the restriction of to the interface F is zero, Uf\p = 0, and: 

u|r = Mo|r (16) 

dnU = dnUo + dnUf (17) 

The definition of the Steklov-Poincare operator S is: 

S : Up ^ dnUo\p (18) 

Applying this mapping to the solutions u^,u'^ of equations (9) and (10) in the domains 
17^ and 17^, respectively, equation (12) becomes with up = u^\p = u‘^\p due to (11): 

(5^ + S'^)ur + 9„iu}|_r + 9„2 u^|/- = 0 (19) 

It remains to discretize this equation in order to solve for This will be done in the 
following section. 



3.2 Discretizing the Interface Equation 

By virtue of definitions (14) and (15) and standard results [2], we obtain in each domain 
respective the linear systems: 

\Apj A^ppJ y Up J \^9„'Uo|_ry ’ 



i = 1,2 



(20) 
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and 







(21) 



Due to the system (9)-(12), we have to solve simultaneously (20) and (21) in both 
domains. Since m* = Mq + Mj, summation of (20) and (21) for each domain, respectively, 
gives: 



f} \ 

yApjA\^p) \u^r J \fr A dniu^lr J 



i = 1,2 



We combine these equations into a single system: 



/ Ajj 0 A]p \ /u/\ 

I 0 Aj^ Ajp ) ( ^? ) 

\Api A^pj A\.p + Aj^pJ \ur/ 




where 



(22) 



fr = fh + fr- 

By solving the first two equations for uj, u\ and substitution into the third equation of 
(22), we conclude that (12) holds iff: 

~ A]pUr) + A^Pi{A^ii) ^{ff — AjpUp) + (A^p + Af^p)ur = fr 

(23) 

This is just the discretized counterpart of (19): 

(S'^ + S'^)ur = fr — A\^j{A\i) ^fj — Apj{A‘jj) ^ff (24) 

S^ur = {A^'pp — A^pj{A^jj) ^A\p)ur , i = 1,2 (25) 

Once Ur is computed by solving (24), and follow from (9), (10) with boundary 
values Ur on the common interface F. 



3.3 Multiple Domains 

Let i?® denote the restriction of the vector of nodal variables m/- on the interface F to 
those on 17* fl F. Analogously to the case of two domains detailed above, the interface 
equation (19) in the case of multiple domains reads: 

ur = fr- '^{RyA^rAAU-^f} , Vz (26) 

Note that all computations restricted to subdomains 17* can be done in parallel! 
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4 Preconditioning, Parallel Implementation, and Convergence 
Rates 



4.1 Interface Preconditioner 



While a fine partition of fl into a large number of subdomains 12® leads to small-sized 
and “computationally cheap” local problems in each subdomain, the condition number 
of the Steklov-Poincare operator S more and more deteriorates [29]. As a consequence, 
preconditioning of the interface equation becomes crucial for an efficient parallel im- 
plementation. 

Among different but provably optimal (“spectrally equivalent”) families of precon- 
ditioners (cf. [9,30]), we examined in particular the Balancing-Neumann-Neumann- 
preconditioner (BNN) [22,12]: 

Pbnn ■■= (I - - S{RyiS^)-^R°) + iRy{S°)-^R^ , 

ill) 

where 

Pnn ■■= D P ( 28 ) 



This preconditioner applied in connection with conjugate gradient iteration [17] preserve 
the symmetry of S' in (18) and naturally extends to more general problems related to 
three-dimensional image sequences or unstructured geometries and/or triangulations. 

Preconditioner carries out a correction step (denoted as “balancing” in liter- 

ature) before and after the application of the Neumann-Neumann-preconditioner (NN) 
(28) on a coarse grid given by the partition of the domain 17 into subdomains (see Figure 
1 ). 

The restriction operator sums up the weighted values on the boundary of each 
subdomain, where the weights are given by the inverse of the number of subdomains 
sharing each particular node, i.e. 

{ I : node i is on an edge of I7j 

j : node i is in a vertex of Qj (29) 

0 : else 



Then, is defined by 



5 ° := R°S{R°y. 



(30) 



Note that 5° is a dense matrix of small dimension (related to the number of subdo- 
mains) which can be efficiently inverted by a standard direct method. 



4.2 Parallel Implementation 

The preconditioned conjugate gradient iteration for solving the interface equations (19) 
and (26) provides several starting points for parallel computation. In the case of the NN- 
preconditioner (28) the calculation of (5'®)“^, which is done indirectly by calculating 
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on the whole subdomain, can be carried out in parallel. Then, the restriction 
matrices i?® and amount to scatter-operations and gather-operations from the 

point of view of the central process. Furthermore, when calculating S during a conjugate 
gradient iteration, parallelization can be employed also. Since S is already written in 
decomposed form in (24) and (26) the procedure is done in an analogous manner as 
with with the main difference (beneath leaving out the weightening by D) that 
here the Dirichlet system (25) has to be solved on each subdomain in order to calculate 
the action of S'®. Both parallelization procedures are combined in the calculation of the 
BNN-preconditioner (27) since both operators are involved here. 

The inversion of the coarse operator So does not provide any possibilities for par- 
allelization since it has been shown to be most practical to calculate So numerically in 
advance, by carrying out (30), and then computing S^j"^ in the central process by using 
again a conjugated gradient method. Since the coarse system is much smaller (the grid is 
equivalent to the subdomain partition) the computation time for this inversion has shown 
to be very small and can be neglected in practice. Furthermore, the initial computation 
of the right hand side in (19) or (26) can be parallelized in an analogous manner. 

4.3 Multi-Grid Subdomain Solver 

Evaluation of the right hand side in (26) as well as S® and (S®) (cf. (25)) needs in par- 
allel for each domain (i.e. processor) the fast solution of the corresponding Dirichlet and 
Neumann boundary value problems, respectively. To this end, we implemented a multi- 
grid solver. Since domain decomposition methods depend strongly on the performance 
and accuracy of their internal solver, we considered the use of multi-grid methods [6, 
16]. These methods are well-known to be among the fastest and most accurate numerical 
schemes for the solution of systems of linear equations. 

In [7,8] such a multi-grid scheme has been proposed for the CLG approach. It 
allowed the computation of up to 40 flow fields for sequences of size 200 x 200 within a 
single second. Obviously, an integration of this strategy into our domain decomposition 
framework seems desirable. In fact, the only difference between the single and the 
multiple domain case is the possible cooccurrence of Neumann and Dirichlet boundary 
conditions in the same subdomain. Once this is taken into account, the implementation 
is straightforward. 

Let us now sketch some technical details of our multi-grid solver. Our strategy is 
based on the pointwise coupled GauB-Seidel method, which is hierarchically applied in 
form of a so called / m/Z multi-grid strategy. An example of such a full multi-grid scheme 
is given in Fig. 2. It shows how a coarse version of the original problem is refined step by 
step and how correcting multi-grid methods, e.g. W-cycles, are used at each refinement 
level for solving. In this context, we chose W-cycIes that perform two pointwise cou- 
pled GauB-Seidel iterations in both its pre- and postsmoothing relaxation step. Besides 
the traversing strategy, operators have to be defined that handle the information trans- 
fer between the grids. For restriction (fine-to-coarse) simple averaging is used, while 
prolongation (coarse-to-fine) is performed by means of bilinear interpolation. Finally, 
coarser versions of the matrix operator have to be created. To this end we rediscretised 
the Euler-Lagrange equations. Such a proceeding is called discretisation coarse grid 
approximation (DCA). 
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Fig. 2. Example of a full multi-grid implementation for four levels taken from [8]. Vertical solid 
lines separate alternating blocks of the two basic strategies : Cascading and correcting multi-grid. 
Blocks belonging to the cascading multi-grid strategy are marked with c. Starting from a coarse 
scale the original problem is refined step by step. This is visualised by the — >■ symbol. Thereby 
the coarser solution serves as an initial approximation for the refined problem. At each refinement 
level, a correcting multi-grid solver is used in form of two W-cycles (marked with w). Performing 
iterations on the original equation is marked with large black dots, while iterations on residual 
equations are marked with smaller ones. 



4.4 Convergence Rates 

In this section, we examine the influence of both the number of subdomains and the 
coarse-grid correction step on the convergence rate of the preconditioned conjugate 
gradient iteration. 

We first measured the number of outer iterations to reach a relative residual error 
||ttp — U 7 -||s/||{tr||s < 10“^ (ur- exact solution; k: iteration index) of equation (26) for 
different number of subdomains. The results are depicted in Figure 3(1). They clearly 
show that the computational costs using the non-balancing preconditioner grow with 
the number of subdomains whereas they remain nearly constant for the preconditioner 
involving a coarse grid correction step (we neglected the time needed for solving the 
coarse small-sized system related to S'®). These results are confirmed by Figure 3 (2) 
where the residual error for a fixed number of outer PCG-iterations is shown. Thus, 
the BNN-preconditioner is much closer to an optimal preconditioner making the con- 
vergence rate independent w.r.t. both the pixel meshsize h and the coarse meshsize H 
by the number of subdomains. It also follows that the solver using this preconditioner 
scales much better with the number of subdomains since the reduced costs for the lo- 
cal problems associated with S'* by far compensate the additional costs for solving the 
coarse-grid system and process communication. 



5 Cluster Computing 

The algorithm, as described in sections 4.2 and 4.3, was implemented in C/C-H- on a 
Linux operating system. An implementation of the Message Passing Interface (MPI) 
was used for parallelization. Benchmarking was conducted by the MPE library included 
in the MPI package. All experiments have been carried out on a dedicated PC-cluster 
“HELICS” at the University of Heidelberg, which comprises 256 Dual-Athlon MP 
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outer i 



2 



residua 
0 . 7 
0.6 
0 . 5 
0.4 
0.3 
0.2 
0 . 1 



( 2 ) 

Fig. 3. Effect of coarse-grid correction. (1) Number of outer PCG-iterations until the residual 
error of system (26) is reduced to \\ui^ — m||s/||uo ~ w ||5 < 10“® for coarse mesh sizes H G 
{126, 84, 63, 42, 28, 21, 18, 14, 12} on a 252 x 252 image using the NN-preconditioner (solid 
line) and the BNN-preconditioner (stippled line). The corresponding numbers of subdomains are 
(2^, 3^, 4^, 6^, 9^, 12^, 14^, 18^, 21^}. The local systems were solved in each subdomain to a 
residual error of 10“®. (2) Relative residual error after 10 outer PCG-iterations using the NN- 
preconditioner (solid line) and the BNN-preconditioner (stippled line) for the same set of coarse 
mesh sizes. The results clearly show the favourable influence of the coarse grid coupling leading 
to a nearly constant error and therefore to a nearly constant number of outer PCG-iterations 
when using the balancing preconditioner on 4 x 4 subdomains and above. Hence, the balancing 
preconditioner is much closer to an optimal preconditioner making the convergence rate nearly 
independent of h and H. 

1.4 GHz nodes (i.e. 512 processors in total) connected by the interconnect network 
Myrinet2000. 

As input data an image pair from a real air-flow sequence provided by the Onera 
Labs, Rennes, France, has been taken. The reference solution Xref was calculated without 
parallelization by the use of the multi-grid solver (full multi-grid, 5 W-cycles, 3 pre- and 
post-relaxation-steps per level) and regularization parameter A = 0.05 (the intensity 
values of the input images where normalized to [0, 1]). The objective was to compare the 
total computation time of the non-parallel solving (by multi-grid) on the whole image 
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Table 1. Computation times for different partitions. Compared to a dedicated one-processor 
multi-grid implementation, domain-decomposition accelerates the computation for 5x5 proces- 
sors and above. The speed-up for 7x7 compared to the computation time on one processor is 
nearly 40 %. 



Partition 
(h./v.) 
512^ pixels 


Image 

size 


Outer 

iter. 


Run 

time 


Comm. 

time 


2x2 


~5rP~ 


1 


1550 ms 


4% 


3x3 


511^ 


1 


960 ms 


5.6% 


5x5 


513^ 


3 


664 ms 


10% 


6x6 


511^ 


4 


593 ms 


10% 


7x7 


512^ 


4 


516 ms 


11 % 



plane on one machine to the total computation time of the parallel solving on N x N 
processors by using Neumann-Neumann preconditioning. Computation was stopped if 
an relative error of 1% had been reached, i.e. ||a;* — a;r.e/|| 2 /||a;re /||2 < 0.01, a;* : 
solution after i conjugate gradient-iterations. 

Comparison to Single-Processor Multi-grid. The time for calculating the vector field 
without parallelization on one processor to the given accuracy was 721 ms (full-multi- 
grid, 2 W-cycles and 1 pre- and post-relaxations per level). In Table 1 the computation 
times of the parallel substructuring method for different partitions are depicted. The 
parameters of the local multi-grid solver where optimized by hand to minimize the 
total computation time. The results show that parallelization by the use of Neumann- 
Neumann preconditioning starts to improve the computation time from 5x5 processors 
and above. The speed-up for 7 x 7 compared to the computation time on one processor 
is nearly 40%. 

Similar experiments for Balancing Neumann-Neumann preconditioning will be con- 
ducted in future. 
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Abstract. We consider the problem of image comparison in order to 
match smooth surfaces under varying illumination. In a smooth surface 
nearby surface normals are highly correlated. We model such surfaces 
as Gaussian processes and derive the resulting statistical characteriza- 
tion of the corresponding images. Supported by this model, we treat 
the difference between two images, associated with the same surface and 
different lighting, as colored Gaussian noise, and use the whitening tool 
from signal detection theory to construct a measure of difference between 
such images. This also improves comparisons by accentuating the differ- 
ences between images of different surfaces. At the same time, we prove 
that no linear filter, including ours, can produce lighting insensitive im- 
age comparisons. While our Gaussian assumption is a simplification, the 
resulting measure functions well for both synthetic and real smooth ob- 
jects. Thus we improve upon methods for matching images of smooth 
objects, while providing insight into the performance of such methods. 
Much prior work has focused on image comparison methods appropriate 
for highly curved surfaces. We combine our method with one of these, 
and demonstrate high performance on rough and smooth objects. 



1 Introduction 

Comparing images is a fundamental part of computer vision systems that per- 
form recognition, alignment and tracking. Many approaches have tackled the 
critical problem of accounting for lighting variations [6,11,13,1,3] when making 
comparisons. These methods work well on rough objects containing discontinu- 
ities or places of rapid change in albedo or shape. However, comparing images of 
smooth surfaces with no edges or texture under varying illumination remains a 
challenging problem. This problem is important since most real surfaces contain 
rough and smooth regions. Handling smooth regions is important for improved 
recognition or dense registration or tracking of such objects. In this paper we 
propose a new measure for image comparison of smooth surfaces, and demon- 
strate its value on the problem of object identification under fixed pose but 
varying lighting. 
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There are three things that seems to be very important in constructing a rep- 
resentation for image comparison. First, finding a representation that captures 
similarities between images of the same object (eg., through quasi-invariance). 
Second, also capturing dissimilarity between images of different objects. Third, 
choosing an optimal measure for comparing the resulting representations. Most 
previous methods have focused on the first problem, by choosing representations 
of images that are invariant, or quasi-invariant to lighting. Edges are a classic 
example. [3] discuss the quasi-invariance to lighting changes of operators that 
use derivatives. Gabor jets are also widely used for image comparison, in part be- 
cause they are also considered to be insensitive to lighting changes (eg., [13]). [6, 
2,18] point out that the direction of the gradient is relatively insensitive to light- 
ing changes. However, it is well-known that quasi-invariance to lighting changes 
is difficult to achieve for smooth objects.^ Hence we will not focus on invari- 
ant representations, but tackle the other two problems: increasing dissimilarity 
between images of different objects while constructing an optimal comparison 
measure. 

The primary problem presented by smooth objects is that nearby albedos and 
surface normals are highly correlated, which causes correlations in nearby inten- 
sities in their images. Consequently, comparisons that treat neighboring pixels as 
independent, such as sum-of-squared-differences (SSD) are not statistically valid. 
Moreover, correlations between image pixels improve the chances that images of 
two different objects will match well, since if they are similar at one point, they 
are likely to be similar at many. We approach this problem by constructing a 
statistical model of the dependencies between neighboring portions of smooth 
shapes. We then use this to model the effect that lighting changes have on the 
appearance of a smooth object. We can then design operators to decorrelate the 
pixels in images of these objects. 

We use whitening to lessen dependencies in the difference between two im- 
ages. Signal detection theory tells us that this is the optimal approach when the 
difference between images of the same object consists of colored (non-indepen- 
dent) Gaussian noise [19]. We show that for a simple model of smooth surfaces, 
this is a good characterization. 

Whitening has often been used for decorrelation of images in image processing 
tasks such as watermarking [8,7], image restoration [20,4,5], and texture feature 
extraction [9,14]. Many methods have used some differential operators or the 
Laplacian [17] to approximate the whitening filter, though [14] used a 2D causal 
linear prediction model to derive whitening filters. 

Whitening decorrelates image intensities, but it does not make them insen- 
sitive to lighting variation. In fact, we prove that no linear filter can produce 
an image representation that is more insensitive to lighting variation than the 
original image. One consequence of this is to prove that non-linear lighting in- 



^ This is made explicit in the analysis of [6], which shows that gradient direction is 
truly invariant to lighting direction for surfaces with discontinuities, and varies more 
rapidly with smoother objects. 




Whitening for Photometric Comparison of Smooth Snrfaces 219 



sensitive methods for rough surfaces, such as the direction of gradient, are more 
lighting insensitive than any possible linear filter. 

To summarize, whitening, like any linear filtering, does not make images of 
the same object more similar. However, it helps to increase dissimilarity be- 
tween images of different objects and allows us to use SSD as the optimal mea- 
sure for comparison. These make whitening a superior comparison method for 
smooth surfaces, which we confirm in our experiments on synthetic and real data. 
We combine whitening with the direction of gradient to produce a comparison 
method that performs very well on both smooth and rough objects. 

2 The Whitening Approach 

As mentioned above, discrimination between smooth objects is difficult due to 
the high correlation between nearby pixels in their images. One consequence of 
this is that pixel by pixel comparisons such as SSD are not optimal. In this 
section we show how to derive linear filters that remove correlations between 
neighboring pixels. These whitened images can then be optimally compared using 
SSD. We take a statistical approach, regarding the difference image. Id = Ii — 
I2 as a random variable (/i and I2 denote two images of the same surface). 
We analyze this considering a Lambertian surface illuminated by distant point 
sources. Neglecting shadows, we can model the images as: Ii = pN s\ and I2 = 
pN S2, where N are surface normals, p is albedo, and si, S2 are light sources in 
two images. Then 

Id = pNsi - pNs2 = pN{si - S 2 ) ( 1 ) 

Dependencies that exist between nearby surface normals of an object lead to 
dependencies in Id, which we treat by modeling Id as colored Gaussian noise. 
(Colored Gaussian noise captures noise with dependencies, whereas white noise is 
independent.) While this model is not strictly true, it is a valuable approximation 
that opens the way to using a whitening filter, which is a standard tool in signal 
detection, to reduce dependency in the difference image. 

2.1 Whitening in Signal Processing 

First we describe whitening. Let n represent as a vector the pixels in the differ- 
ence image. Assume that n is Gaussian colored noise. This implies that it is fully 
characterized by its first and second order statistics. In particular, the whiten- 
ing filter may be designed using the covariance matrix. Let C = A[nn^] be 
the covariance matrix characterizing the distribution of n {E denotes expected 
value). Let IF be a matrix composed of the scaled eigenvectors of C, as 

rows. Then, the components of y = IFn are independent, as implied from their 
Gaussianity and their covariance: 

^^[yy^] = diag{\i,\2, 

That is, the multiplication by the matrix IF“whitens” the vector n. 
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Fig. 1. The roughly planar (random) surface is specified (in 2D approximation) by the 
angle 6{x) that the normal makes with the 2 : direction. 



2.2 A Model for Natural Images Rough Plane Covariance 



To whiten a surface’s images, we must understand their covariance structure. 
Consider a surface characterized by normal vectors that make small random 
perturbations about a common direction (without loss of generality the 2 axis) . 
We refer to such a surface as roughly planar and assume that locally a smooth 
surface behaves like a roughly planar surface. This is a generalization of the 
common facet model [10]. Considering the simplified, ID, variant, the “surface” 
is described by a function z = f{x). The normals at every point x are random 
(but not independent!) and each of them is specified by a single parameter 6, 
which is its angle relative to the 2 axis (Figure 1). Quantitatively we characterize 
the function 9{x) as a wide sense (w.s.) stationary Gaussian random process [16]. 
That is, we assume that the expected value at every point is constant jig = 0, 
that the variance Cg{x, x) = ag is constant as well, and that the auto-correlation 
Cg(xi,X 2 ) = r(xi, X2)crg = r(ja;i — X2\)(j‘g depends only on the distance between 
two points .r(lxi— X 2 ])isa correlation coefficient . We also assume that the surface 
is Lambertian, and that its albedo p, is constant, at least locally. Proposition 
1: Under the above assumptions and for a distant light source, illuminating the 
surface at angle (f) (relative to the 2 axis), the reflected light function I{x) is a 
random w.s. stationary process. Its expected value, variance and auto-correlation 
are: 



E[I{x)] = pcos(j)e 

erf = ^p^{sin^4>{l — + cos^(j>{l — 

1 



(2) 



C[{xi,X 2 ) = -p‘^{sin^(f)e ‘^'^ 0 ) + cos^(j>{e 2e '’’")) 



where x\, X 2 are the two points for which the correlation coefficient of the tangent 
direction is r = r(ja;i — X 2 \)- 



Proof. (For details see [15]) The reflected light function I{x) is a random process. 
Let xi,X 2 be two points for which the correlation coefficient of the tangent 
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direction is r = r(||a;i, X 2 ||)- Then, their autocorrelation is 



Cl{Xi,X2) 



E[{I{x^)-E[I{x)]){I{x^)-E[I{x)])] 

E[{sin(f>sin9i + cos(pcos9i — cos(j)E[cos9i]) ■ 

{sin(f>sin92 + cos(j)cos92 — cos(j)E[cos92])] 

{sill? (f)E[sin9isin92\ + cos^</>i?[cosdicos02] — cos^ (!)E[cos9]^ 



\p^{sir?(j)e '^0 — e + cos‘^(j){e + e 2e '^»)) 



Note that all sin9iCos9j terms vanish due to symmetry. The rest of the derivation 
requires us to change variables, to the sum and difference of 9\ and 92, which 
are independent. Simple trigonometric expressions and the Gaussian integral 
cosxe~^ dx = \p2?K\a\e~°' are used as well. 

For rougher surfaces (larger a^) correlation decreases while for the (impossi- 
ble) white surface (independent normals, r = 0), the image is white as well. 

The covariance in eq. 2 is non-stationary and it varies with cf). It can be shown 
however, that of the two additive terms in the covariance expression the first is 
dominant, provided the surface is smooth (ag is small) and that the illumination 
angle (j) is not very small. This readily implies that: 



Covariance characterization for rough Lambertian plane: the second or- 
der statistical behavior of a rough Lambertian, planar surface, illuminated by a 
single source, is characterized by an autocorrelation function which, for nearly 
every illumination, is approximately invariant of the illumination direction up 
to a multiplicative factor. 

See [15] for experimental validation of this result for real objects. 



2.3 Whitening Using AR Models 

Designing a whitening filter by estimating the covariance is problematic as the 
covariance (and the mean) are nonstationary. Fortunately, fitting a parametric 
Autoregressive (AR) model, allows us to get the whitening filter directly without 
explicitly estimating covariance [12]. 

A sequence x{n) is called an AR process of order p if it can be generated as 
the output of the recursive causal linear system 

p 

x{n) = ''^^a{k)x{n — k) + e{n),\/n (3) 

k=l 

where e{n) is white noise, and the sum x(n) = a{k)x{n — k), is the best 

linear mean squared (MS) predictor of x(n) based on the previous p samples. 
Given a random sequence (with possible dependencies), an AR model can be 
fitted using SVD to estimate the overdetermined parameters a{k) which mini- 
mize the empirical MS prediction error ~ x{n)?. For Gaussian signals 
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the prediction error sequence: e(n) = x{n) — x{n) is white, implying that the 
filter W = (1, — Oi, . . . , — Op) is a whitening filter for x(n). We have adopted a 2D 
“causal” model described in [12], where a gray level x(n) is predicted from the 
previous gray levels in a p x p neighborhood in column by column scan. Using a 
non-causal neighborhood leads to a lower SSD, but the prediction error sequence 
is not white [12]. 

Note that scaling all the grey levels by the same factor would give a correla- 
tion function that is the same up to a multiplicative constant. This is essentially 
what happens when the angle between the average normal and the illumination 
direction changes. Fortunately, this does not change either the AR coefficients, 
or the resulting whitening filter, implying that it can be space invariant. 

The whitening filter depends on the image statistics. Intuitively, for smoother 
images the correlation is larger and decorrelating it requires a wider filter. For 
images which are not so smooth the decorrelation is done over a small range, and 
the filter looks very much like the Laplacian, which is also known to have some 
whitening effect. Therefore, for rougher images, we do not expect to perform 
better than an alternative procedure using the Laplacian. As we shall see later, 
for smooth objects the performance difference is significant. 

2.4 Whitening Images from Different Objects 

Signal detection theory tells us that whitening is useful for image comparison 
because whitened images from the same object can be optimally compared using 
SSD. Whitening has another advantage, it makes images from different objects 
more distinctive. 

To see this, let S denote a 3D surface. We will take two pictures of S' in a 
fixed pose with two different point sources of light, si and S 2 - si, S 2 are each 3x1 
vectors that encode lighting direction and magnitude, pij denotes a patch of the 
surface corresponding to an image pixel. We approximate Pij as a planar patch, 
with surface normal JVij, and albedo pij. It will be convenient to denote the 
scaled surface normal PijNij by Nij. We denote the image pixels corresponding 
to Pij by hjjjh.ij in the two images. So we may write, for example, hjj = 
NfjSi, since we ignore the effects of shadows. 

Let L denote a whitening filter, represented discretely as a matrix with ele- 
ments Lk^i- Without loss of generality we suppose L is square and —n < k, I < n. 
If we apply this filter to the image /i we denote the output as Ii. So: 

n n 
k— — n l ——n 

We can define a new surface, tS, such that its scaled surface normals are: 

n n 

~ ^ ^ ^ ^ Lk,l^i-\-k,j-\-l 

k— — n l ——n 

Intuitively, S can be thought of as the surface filtered by L. According to our 
model, while the original normals are highly correlated, the whitened normals 
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will be white noise, with randomized directions and scales. As high-dimensional, 
white noise, different whitened surfaces will also be uncorrelated with each other, 
with high probability. This is analogous to taking a smooth, white surface and 
splattering it with gray paint. Smooth surfaces are easily confused with each 
other, while highly textured ones are not. Of course, whitening does not add 
differences to signals, it makes explicit the differences that are already there. 

More formally, communication theory tells us that discriminating between 
correlated models is difficult. Specifically, for two unit energy signals Zi{x), Zj{x), 
the correlation coefficients is pij = f Zi{x)zj{x)dx. For best performance, the 
correlation coefficient between any pair of models should be as low as possible. 
For two signals the lowest correlation is —1, and choosing Z2{x) = —zi{x) is 
optimal. When the number of signals is large, such correlations between all 
signal are not possible, and the best we can get is p « 0 [19]. 

Whitening treats the signals and the noise equally and therefore leaves the 
signal to noise ratio (SNR) the same. However the whitened signals become 
uncorrelated and therefore with the same SNR we get better performance. The 
correlation between the original images associated with different objects is high 
initially and is almost zero afterwards, so the improvement is significant. 



3 Invariance and Linear Filtering 

While most prior work has focused on finding lighting insensitive image compar- 
isons, we have not argued that whitening is lighting insensitive. We now prove 
a result that casts doubt on the ability of any linear filter to produce lighting 
insensitive representations. 

Theorem 1. Suppose that the lighting directions si and S2 are drawn from a 
uniform distribution, and that we neglect the effects of shadows in images. Then 
h,i,j / l2,i,j o^nd XiijjX2^i^j are identically distributed. That is, the distribution 
of the ratio of intensities between one image of an object and another are un- 
affected by filtering with an arbitrary linear filter. In this sense, no linear filter 
can produce a lighting insensitive representation. 

Proof. This follows immediately once we consider that linearly filtering the im- 
ages is equivalent to filtering the surface, as described above. Let Af denote the 
filtered normals, as above, but now for an arbitrary linear filter. Let Afij denote 
a unit vector in the direction of Afij. 

NijS2 AfijS2 

Since si and S 2 are uniformly distributed it is clear from symmetry that these 
two fractions are identically distributed, because N and are identical up to a 
rotation. In sum, we have created a filtered surface that is affected by lighting 
changes exactly as the original surface. 
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It is possible to extend this result to handle the case of attached shadows by 
restricting the distribution of light sources to appear in a hemisphere above the 
surface normal. However we omit details of this for lack of space. 

4 Experiments 

We tested our ideas by applying them to object recognition. A set of objects is 
represented in a library containing one image for every object. Let Imi, ■ ■ ■ 
be reference images in the library. Let Iq be the query image of one of the 
objects from this set, taken with the same pose, but different illumination. The 
task is to decide which of the objects is the one in the query image. Since 
the reference image Jm„ was taken with a different illumination intensity than 
the test image, every scaled version of it is a valid model as well. Minimizing 
the SSD over all scaled versions is equivalent to taking the SSD between the 
normalized whitened images, which is monotonic in the projection as well. This 
normalization also compensates for the fact that some objects are rougher than 
others, which makes the difference between two differently illuminated images 
of them larger. Therefore we perform the following steps: 1) For every reference 
image, Imj, use the whitening operator W, to calculate the normalized L2 norm 

“ II ||1v(Jm^)|| ~ ir^HoTH II’ Choose the model associated with the smallest 

whitened error norm, Ej. 

We tested the whitening approach on smooth textureless surfaces. We also 
integrated whitening with a comparison method designed for rough surfaces, and 
showed that this combined method could work on rough and smooth surfaces. 

4.1 Synthetic Images 

The first set of experiments was done using synthetic images. Every scene was 
created as a sum of random harmonic functions, with fixed amplitudes but ran- 
dom directions and phases. This provides an ensemble of images with similar 
statistical properties. These were rendered as Lambertian surfaces with point 
sources. 

We trained a whitening filter using 1000-5000 images with a fixed illumi- 
nation, deviating 67.5 degrees from the z direction. The training set was inde- 
pendent of the test set. A test was done as follows: two random scenes were 
illuminated by the same nearly vertical illumination to create two references 
images A, 7^. The test image It was synthesized from the first scene, with a 
different illumination, making an angle (f> with the z axis (see Figure 2). 

For comparison we also tested other algorithms using the SSD of the gray 
level image, a Laplacian filtered image, and the direction of the gradient^. See 
Figure 2 for the results. 

We came to several conclusions. First, whitening was the most successful 
method. Second, whitening worked best with a large filter, but it also worked 

^ We did not test Gabor Jets on the synthetic images, but later experiments show 
that they are not especially effective on smooth surfaces. 
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Fig. 2. The top and the center images in the left colnmn correspond to different surfaces 
and one illumination. The bottom image is created from the same scene used for the 
top image, but with a different illumination. The center column shows the whitened 
images and illustrates that whitening reveals hidden differences. The plot on the right 
shows recognition performance of the tested methods on the synthetic images. The 
success rate is plotted against the average angle between the illumination source and 
the average surface normal. 




Fig. 3. Samples from the smooth real objects data set; top - frontal illumination, 
bottom - side illumination. 



substantially better than other methods even with a 7 x 7 filter, except for 
extreme illumination angles. In particular whitening was always better than the 
Laplacian, even when a 3 x 3 filter was used, implying that both large distance 
correlations and causality are important. 

4.2 Real Smooth Objects 

Next, we describe experiments with real, smooth objects that produce images 
with substantial shadows (Figure 3). We created eighteen objects from clay and 
illuminated them by a single light source moving along a half circle, so that its 
distance from the object was roughly fixed. We used a camera placed vertically 
above the object, and took 14 images of every object with different lighting 
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Fig. 4. Recognition performance of the tested methods on real smooth objects on the 
left and rough objects (Yale database) on the right. The success rate is plotted against 
the average angle (in degrees) between the illumination source and the average surface 
normal. 



directions at angles in the range [—70, 70] degrees to the vertical axis. One image 
of each object, associated with a nearly vertical illumination, were chosen as the 
reference images. 

The whitening filter was trained on the difference images between reference 
images and corresponding images associated with the same object and six other 
illuminations. Only twelve images associated with 2 objects (out of 18) were 
used. We learned the whitening filter as a 2D causal filter with 25 coefficients 
inside 7x7 windows. All images of the 18 objects except the reference images 
were used as query images (234 images). We divided the query images into four 
groups according to their angular lighting direction: 10° — 25°, 26°— 40°, 41° — 55°, 
and 56° - 70°. 

The plot in Figure 4 shows our results. Whitening again performed better 
than the other methods. We also observed that for a few of the roughest ob- 
jects, the Laplacian, whitening and gradient angle performed equally well. For 
smoother 5objects, however, whitening worked considerably better. The Lapla- 
cian couldn’t whiten the smooth surfaces, because its size was insufficient to 
handle the high correlations between the grey levels of the smooth surfaces. 

4.3 The Combined Method 

To handle objects that may be rough or smooth, we propose that whitening be 
combined with a measure that is geared towards handling rough objects, such 
as the direction of gradient. We have done a proof-of-concept implementation 
of a simple combined method. Direction of gradient is naturally normalized to 
the [0, tt] range. Whitening, however, requires normalization prior to combining. 
Let si,S 2 ,...,s„ denote the distances between the query image and n reference 
images after whitening. We normalize them to the [0, 1] range by dividing all the 
distances by maxjsij. Different areas in the image can have different roughness 
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levels. We compensate for this effect by choosing the normalization factor adap- 
tively in 10 X 10 pixel areas instead of the whole image. Our experiments showed 
that adaptive normalization yields better results. We have also scaled the direc- 
tion of gradient output to the [0, 1] range. We have tested the combined method 
on both smooth (Figure 4 left) and rough data (Figure 4 right) sets. As a smooth 
set we took the clay objects described in the previous section. As a rough set 
we took the Yale database [6], which contains 20 objects with abrupt changes 
in albedo and shape. The database consists of 63 images of each object with 
lighting direction deviating up to 90 degrees from the frontal. Our experiments 
showed that the combination of whitening and direction of gradient (CWD) was 
better than either whitening or direction of gradient alone on both data sets; 
and CWD had the best (and perfect) performance on the smooth set. On the 
rough data the combined method performed very well, but not as well as Gabor 
Jets. In future work we plan to continue this approach and try to find a more 
clever combining technique that will integrate whitening with some variation of 
Gabor Jets. We also tested a combination of Laplacian and direction of gradient. 
This combination performed less well than CWD on smooth data and similar to 
CWD on the rough data. The Laplacian has some whitening effect, which ex- 
plains its good performance on smooth data. On the other hand, decorrelation 
in the rough objects occurs over a small range, and the whitening filter looks 
very much like the Laplacian explaining the results on the rough set. 

5 Conclusions 

In this work we have proposed a measure for image comparison of smooth sur- 
faces under varying illumination. The measure was motivated by a simple sta- 
tistical model of smooth surfaces. This model showed that the error between 
two images associated with the same object under different lighting may be 
modelled as colored noise. We adapted well-known techniques of whitening to 
perform matching of images corrupted by such noise. 

We found that whitening was more effective than other representations for 
comparing images of smooth surfaces taken under varying illumination condi- 
tions. Previous methods have commonly used the Laplacian or the magnitude 
of gradient, as whitening approximations. This seems to be adequate for rough 
images but leads to inferior results for smoother ones. 

We believe that recognition (or image comparison in general) should use all 
the image information. Many current methods neglect photometric information 
and thus cannot handle smooth objects. Our preliminary results showed that a 
proper combining method, using both the information in edges and in smooth 
patches, would yield superior results, especially in hard tasks. 
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Abstract. We investigate the camera geometry of lines parallel in the 
world. In particular, we formalize the known rotational constraints and 
add new linear constraints on camera position. The constraints on camera 
position do not require the cameras to be viewing the same lines, thus 
providing applications for occluded scenes and calibration of cameras for 
which fields of view do not intersect. The constraints can also be viewed 
as constraints of camera geometry with planar patch coordinate systems, 
and provide a way to investigate texture in a deeper way than has been 
done to date. 



1 Introduction 

The geometry of parallel lines has been used extensively in computer vision, but 
to our knowledge only by way of the plane at infinity using the computation 
of vanishing points. The two main applications are calibration and shape from 
texture, and both are based on the principle that the vanishing point of a set 
of parallel lines is not affected by translation, as it lies on the plane at infinity. 
While these are important applications, there are geometric relations on sets 
of parallel lines embedded in a planar patch, which take into account distances 
between the lines rather than just their vanishing point. 

Vanishing points have been used by many in computer vision, mostly for the 
determination of rotation and calibration for which they are particularly well 
suited, since they are unaffected by translation. There are numerous examples 
[1,3]. These methods have not looked further into the lines of which the vanishing 
points are composed, but it is helpful to look at the individual lines. 

Lines have been used extensively in computer vision [5,7,4], but in general 
have not been as prominent as points, probably because they are difficult to 
work with. However, the use of of Pliicker coordinates [6,2,8] can make many 
reconstruction and constraint derivations easier. In this paper we introduce an 
extension of the Pliicker coordinates for lines in order to investigate lines em- 
bedded in a plane. 
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2 Notation and Reconstruction 



Our use of parallelism restricts our world points to be represented by 3-vectors 
P. We use homogeneous coordinates for image points p, so they are also 3- 
vectors. Image lines are also represented by homogeneous 3-vectors £. If a point 
p is on a line then their coordinates are perpendicular = 0. We use the 
general linear B = KR to encapsulate a rotation followed by a calibration. For 
a matrix B, we use B~"^ to denote the inverse transpose of that matrix. Points 
and lines that we actually measure in an image we denote by p and £. Note 
that if p = B{£i x€ 2 )> then p = (B~'^£i) x (B~^£ 2 ). For clarity, in equations 
we often use p and £, which denote the calibrated and derotated coordinates for 
those points and lines. 



2.1 Lines 

We use the Pliicker coordinate system for world lines, which is particularly well 
suited for rigid motions of lines. 

Definition 1. A world line L is the set of all the points P G such that 
P = (1 — A)Qi -I- AQ 2 for two points Q^, and some scalar A. The Pliicker 
coordinates of this line are L = [ ] , where: 



Ld = Q 2 — Qi direction of L (1) 

Lim = L(j X P moment of L (2) 

If we have a line L and a camera {B, T), then the image line associated with 
L is 

£ = B~^{L^-TxLd) (3) 



where £ is perpendicular to the plane containing the line in the image. If B is the 
rotation/calibration matrix for a point P, then B~"^ is the rotation/calibration 
matrix for a line L. 

Lines are easier to reconstruct than points because the reconstruction always 
exists. It is easily proved that: 

Proposition 1. If we have a line L in space which projects to two image lines 
£\ and £2 in distinct cameras (i?i,Ti), and (B2,T2), then we can calculate the 
coordinates for L if \£i x£ 2 \ ^ 0 , if £i = BJ£i: 



L = 



£\x£2 

£xT 11£2 - £2Tf£i 



(4) 



It is possible that the cross product above will be zero. In this case the line exists 
in the plane at infinity. 
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2.2 Singly Textured Planes 

We introduce a new object to computer vision to encapsulate a set of parallel 
lines embedded in a plane. We motivate our definition as follows. Consider one 
line from the set of equally spaced lines in the plane, call it Lq. We may represent 
this line using Pliicker coordinates as Lq = [ Qo ^ point on 

Lq, and the point Q^, = Qo + xd to be on the line at distance x in the texture 
for some direction d. We can easily show that: 

Lfi X Qq = hjjiQ (5) 



so that to get 



^m,n ~ X Qx (6) 

= Ljj X Qo + xLjj X d (7) 

= Lm + xh\ (8) 



where L;!^ = L^jxd. Note that since both and d are vectors which lie inside 
the plane, we must have that La is normal to the textured plane. This leads us 
to the following definition, as shown in figure 1 

Definition 2. A singly textured plane H is a set of lines, equally spaced, 
embedded in a world plane. We give the textured plane coordinates 



H = 




(9) 



with L^Lm = 0 and LJJLa = 0. The coordinates of each line in the plane, indexed 
by n are: 



L 



n 



U 

Lm + uLa 



(10) 



Our constraints are all based on intersection conditions between two textured 
planes. 



Fact 1. If we have two textured planes Hi and H 2 , then they lie on the same 
world plane if and only if: 



and 



LIiL. 



LI2L. 



,1 = 0 



( 11 ) 



L- iLa,2 = 0 L- 2La,i = 0 (12) 

We now turn to the reconstruction of a textured plane from image lines in 
four cameras. This reconstruction is non-intuitive in a sense because we do not 
require that the cameras be looking at the same lines. Each of the four cameras 
can look at a different line. We only require that we know which line has been 
imaged, that is, its index n. Given these four lines we can reconstruct a textured 
plane, as in figure 2, with the following multilinear equation. 
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Fig. 2. Reconstructing a Textured Plane 



Fact 2. If we have a textured plane H which is imaged by four cameras into 
image lines (.i, and we know that our cameras have parameters and 

further, we know that the image lines have indices ni, then we may reconstruct 
the textured plane as: 





'L/ 


E 


l'^^3 ^ "^24 1 ^ "^22 ) 


H = 


Lm 






La_ 


[A *2 »3 i4]6perm+(i 234 ) 





Note that |-| is the signed magnitude, and since the coordinates are homogeneous, 
it does not matter which sign is chosen. The same result could be obtained by 
defining 

\£,xe,\ = \£,v£,\ ( 14 ) 

where v is any arbitrary vector not in the plane of£iX£j. 



The proof of this is in the supplement. 
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3 Rotational Constraints 

If we have three image lines £i, each of which are images of one of a set of parallel 
world lines, then using two of the cameras we may reconstruct the direction 
of the world lines. From the construction of the Pliicker lines, we know that any 
line with direction must have moment vector perpendicular to L^. Putting 
these two facts together, we obtain 

Proposition 2. If we have one, two, or three parallel world lines, and three 
cameras with rotation/ calibration matrices Bi, then if these three cameras view 
images of one of our world lines as £i, with the lines not necessarily the same in 
all cameras, then we obtain the prismatic line constraint. 

£lB2{Bf£ixBlh) = Q (15) 

If we identify cameras 2 and 1 by setting B2 = Bi, which corresponds to the 
case where both £i and £2 are taken from the same camera. If these are different 
parallel lines, then we obtain the vanishing point constraint, noted by many, for 
example [9] 

Proposition 3. We have two or three parallel world lines, and two cameras 
with rotation/ calibration matrices Bi. If camera 1 views image lines £\ and £3 
and camera 2 views image line £2 we obtain the vanishing point constraint. 

£lB2B/\hx£3)=0 (16) 

ilB2B/^p = 0 (17) 

The quantity p = £1 x .(?3 is called a vanishing point, and it is the point through 
which all images of world lines of direction will pass. The constraint says that 
if we have a vanishing point in one image and a line in another image which we 
know is parallel to the lines in the first camera, then we have a constraint on 
the Bi. 

If we further identify cameras 2 and 1, then given an image of a set of parallel 
lines in one camera, we know that we must still have a zero triple product. 

Proposition 4. We have three parallel world lines, and a camera with rota- 
tion/calibration nonlinear function B : — >■ Given images of these three 

world lines £i, i G [1,...,3]. We obtain the the vanishing point existence 
constraint . 



=0 (18) 

This last constraint means nothing if i? is a linear function, since the constraint 
would be trivally satisfied. However, in the case where there is some nonlinear 
distortion in the projection equation, there will be a constraint on B, so we 
may say that the prismatic line constraint operates on 1, 2, or 3 cameras. We 
next go over the standard multilinear constraints to show how the prismatic line 
constraint relates to them. 
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4 Translational Constraints 

It is as simple to form the texture constraints as it was to form the previous 
constraint. There is a line texture constraint a point texture constraint, and a 
mixed constraint. Keep in mind that all these constraints can be applied to fewer 
cameras by identifying the camera positions associated with various subsets of 
lines. 

The first is the five camera constraint, which we call the harmonic trifocal, 
as shown in figure 3 




Fig. 3. The Harmonic Trifocal Constraint operates on five image lines 



Fact 3. If we have five cameras and measure five lines ii, which have 

indices Ui from a textured plane H. We may form the ti using the ti and the Bi 
and have the following constraint: 

0= ^ (19) 

hi..i5]eP+[i..5] 



Where perm~^ denote the even permutations. 

Proof. Using fact 2, we may reconstruct the textured plane to obtain the pa- 
rameters of the textured plane H using the lines one through four. Using this 
reconstruction, we can find the fifth image line as: 

£5 = Lm + ''T'sLa — T 5 X Ld (20) 

If ps is a point on £ 5 , we know that P 5 is perpendicular to £ 5 , so that pJ£b = 0. 
We can use this with the above equation to formulate the constraint. Note that 
since is perpendicular to £5 that is a point on the line £ 5 , but if we set 
p = hd, ah of the right hand side terms disappear and we have no constraint. 
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Therefore we know that there is only one equation in our constraint, and we use 
P5 = X ^5 . We can derive 



0 =(LdX£5)'^(Lm + jt-sLa — T5 xLd) ( 21 ) 

= |Ld £5 Lm| + £5 La| — (Ld x^5)^(T5 x L^;) (22) 

we use vector algebra and the fact that = 0 to obtain — LJL^T'^^s for the 
last term 

^ [2n,,n,,{£,,x£,,r{£,x£,,)Tl£,, (23) 

bi..i4]eP+[i..4] 

+ 2n,-,n5(^,, X 3)^(4 (24) 

+ njirij^{£j^x£j^) {£jj^x£j^)'T^£5 (25) 

which we can expand to 

= ^-^ 12 ) i^5'>^£j3)T^ji£ji (26) 

bi..i4]eP+[i..4] 

+ {£j^ X £j ^ ) {£j^ X £^)Tj^£j^ (27) 

+ n 5 nji (^5 x^ja) {£j^x£j^)'Tj^£j^ (28) 

+ riji ri 5 {£j^ X £ 5 ) (£j 3 x £j^ ) T (29) 
+ njiTij^{£j-^x £j^) {tj^x£j^)T^£^\ (30) 



and this is equal to the desiderata. 

Next is the mixed constraint, which operates on six cameras, as in figure 4. 




Fig. 4. The Hexalinear Constraint operates on six image lines 
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Fact 4. If we have six cameras which measure six lines t-i on a doubly 

textured plane, and the first four lines measure one texture with indices Ui and 
the last four lines measure the other texture with unknown indices, then we may 
form the following constraint: 

0= ^ n,,\£5£6£^M2X^^,\'^^A. (31) 

[21..Z4]gP'*"[1..4] 

Proof. We may reconstruct the La,i of the singly textured plane from the first 
four cameras. We may reconstruct the Ld ,2 of the world line using the last two 
cameras. Using fact 1 and 2 we may easily obtain the equation. 

Last is the harmonic epipolar constraint, which operates on eight cameras, 
as in figure 5 




Fig. 5. The Harmonic Epipolar Constraint operates on eight image lines 



Fact 5. If we have eight cameras which measure eight lines £i on a 

doubly textured plane, and the first four lines measure one texture with indices 
Ui i € [1..4] and the last four lines measure the other texture with indices Ui 
i € [5.. 8], then we may form the following constraint: 

0 — ^ ^ X I ’ "^*2 (32) 

[ii..ig]esP'*^ 

where sP^ indicates the even permutations among the first four and the last four 
indices, plus switching the first and last four sets of indices wholesale. 

5 Applications 

While these constraints seem strange, they are also useful and can solve prob- 
lems in computer vision not accesible with current methods. We present a few 
examples here. 
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5.1 Rotation from Oriented Textures 

If we have three cameras, the rotational constraints can be used even if we can’t 
find vanishing points, or even lines. If we use a simple autocorrelation metric on 
projected textures of wood as in figure 6, we can get a line direction, which is 
enough to input into our prismatic line constraint if we have three cameras. 




Fig. 6. A simple autocorrelation can measure orientation for use with the harmonic 
directionality constraint 



5.2 Calibration 

With the advent of the use of many cameras, the problem of calibrating them 
has come to the fore. More and more camera system whose cameras do not 
necessarily share fields of view are being created. For cameras which do not share 
fields of view, it is certainly possible to calibrate them rotationally together using 
bundles of parallel lines, and using the prismatic line constraint. This has been 
known. 

However, the new constraints, particularly the harmonic epipolar constraint, 
allow us to calibrate translationally by showing a grid of boxes, with a couple of 
boxes singled out by a different appearance. This allows us to input the known 
line indices into our harmonic epipolar equation which then results in the com- 
putation of translation for sufficiently numerous and different views of the boxes 
in the plane. 



5.3 Textures and Correspondence 

One of the more interesting applications for these equations lies in the analysis of 
the correspondence problem together with our camera geometry. In order obtain 
a deeper insight into the correspondence problem, we need to relax our idea of 
correspondence. For if we have corresponding points as input there is clearly no 
reason to develop constraints on textures. Often obtaining these correspondences 
is difficult, so we break up the problem. 
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If we have two images, we say that two image regions are in patch cor- 
respondence if they are images of the same planar patch with some uniform 
texture property. For instance, in figure 7, the amorphous patches of the two 
buildings are in patch correspondence, even though we have not corresponded 
individual points. We show how to use this idea as a basis for hard geometry 
constraints. 





Fig. 7. The amorphous regions are in patch correspondence 



Somehow, the intuitive notion of the wavelength in some signal has to enter 
the consideration. If you have a collection of textured planes, and these planes 
do not contain any wavelength greater than A, then it is clear that if our camera 
positions are not known to accuracy at least less than A, then it is impossible to 
compute any sort of correspondence. On the other hand, if we know our camera 
positions to within a << A, and we have many textures with wavelengths greater 
than A, then it should somehow be possible to match with a high degree of 
probability. The smaller a is, the higher the probability that we find a match. 

The above intuitive notions strongly suggest that the next step in making 3D 
models is to formulate a feedback mechanism. Bundle adjustment is some form 
of feedback, but it doesn’t utilize any new measurements. Somehow, the feedback 
mechanism should work in a way that better measurements are introduced in 
the process. 

We can use our new constraints in the following way, with some admittedly 
broad assumptions. We assume that we have obtained a reasonable estimate of 
the camera calibration and rotation. We also assume that we have knowledge 
of our cameras positions to within a. If we know that our cameras are looking 
at a regular texture, then we can measure the positions of some equally spaced 
lines with distance A >> a. Since we know these lines are equally spaced, then 
we know that our line indices n = mA, for some integer m. So we can use our 
approximate camera positions the knowledge that m is an integer to form a 
search over the set of integers for the m that give us the least error. These will 
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probably be the correct m. We can then use these m to obtain more accurate 
translational positions. This feedback loop can continue from larger to smaller 
wavelengths. 

In order to use these constraints we need to be able to demonstrate that we 
can find lines within textures. While in this paper we cannot lay out an entire 
theory on finding lines (or peaks in the Fourier domain), we will show a few 
examples where we can address textures using geometric constraints where the 
standard multilinear constraints would have difficulty. 

Obviously there are some textures which actually contain lines, such as in 
figure 8. However, there are other textures for which the lines are not readily 
apparent but in which we may find “virtual lines” corresponding to strong fre- 
quency components, such as in figure 9. We posit that a singly textured plane 
corresponds to a particular peak in frequency space, if such a peak exists. A 
real texture may have many peaks in frequency space, some of which are more 
prominent than others. If we have a few prominent peaks, this means that there 
is a strong regular repetition in the texture. This is just the situations where 
standard correspondence methods would have trouble. In this way our method 
complements the standard structure from motion methods. 




Fig. 8. Lines are easy to find in this texture 




Fig. 9. We can still find the same lines in affinely transformed textures 



We showed above two textures for which it is relatively easy to find lines. We 
may apply our multilinear constraints to the lines in these textures. 

Many textures are not as regular as the above textures. In this case, we 
need the entire patch to be visible in all cameras for the peaks to correspond to 
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each other. But if the entire patch is visible from various cameras, we may find 
corresponding peaks. 

But what if we have textures for which lines are not at all apparent, such 
as the beans in figure 10? This picture still has some frequency peaks, which 
generate the lines shown in that picture. Indeed, even if we affinely transform 
the image of the beans we may still find the same set of lines, as in figure 10. 
This allows us to use our constraints on the position of plane containing the 
beans and the position of the cameras. 




Fig. 10. Even with random textures the lines still exist if we have the whole patch 
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Abstract. This paper presents a Bayesian framework for multi-cue 3D 
object tracking of deformable objects. The proposed spatio-temporal ob- 
ject representation involves a set of distinct linear subspace models or 
Dynamic Point Distribution Models (DPDMs), which can deal with both 
continuous and discontinuous appearance changes; the representation is 
learned fully automatically from training data. The representation is en- 
riched with texture information by means of intensity histograms, which 
are compared using the Bhattacharyya coefficient. Direct 3D measure- 
ment is furthermore provided by a stereo system. 

State propagation is achieved by a particle filter which combines the three 
cues shape, texture and depth, in its observation density function. The 
tracking framework integrates an independently operating object detec- 
tion system by means of importance sampling. We illustrate the benefit of 
our integrated multi-cue tracking approach on pedestrian tracking from 
a moving vehicle. 



1 Introduction 

Object tracking is a central theme in computer vision with applications ranging 
from surveillance to intelligent vehicles. We are interested in tracking complex, 
deformable objects through cluttered environments, for those cases when simple 
segmentation techniques, such as background subtraction, are not applicable. 

This paper presents a probabilistic framework for integrated detection and 
tracking of non-rigid objects. Detections from an independent source of informa- 
tion are modeled as “mixture” of Gaussians and are integrated by two means: 
They control initialization and termination by a set of rules and serve as addi- 
tional source of information for the active tracks. 

To increase robustness three independent visual cues are considered. Object 
shape is used, since it is (quite) independent of the complex illumination con- 
ditions found in real world applications and efficient matching techniques exist 
to compare shape templates with images [6]. Texture distributions are modeled 
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as histograms [14,15], which are particularly suitable for tracking since they are 
independent of object shape, invariant to rotation, scale, and translation, and 
easy to compute. Finally stereo measurements are integrated into the system. 

In this work, tracking proceeds directly in 3D-space, which allows a more 
natural incorporation of real-world knowledge (e.g. kinematical properties of 
objects) and simplifies reasoning about occlusion and data association. 

The outline of the paper is as follows: Section 2 reviews previous work. Our 
proposed multi-cue object representation is described in Section 3. It consists of 
two parts; the first deals with the spatio-temporal shape representation and the 
second relates to the texture model. The proposed particle filtering approach 
for multi-cue 3D object tracking is presented in Section 4. It integrates an in- 
dependently operating external detection system. We illustrate our approach in 
Section 5 on the challenging topic of pedestrian tracking from a moving vehicle. 
Finally, we conclude Section 6. 



2 Previous Work 

Bayesian techniques are frequently used for visual tracking. They provide a sound 
mathematical foundation for the derivation of (posterior) probability density 
functions (pdf) in dynamical systems. The evolution of the pdf can in principle 
be calculated recursively by optimal Bayesian filtering. Each iteration involves 
a prediction step based on a dynamical model and a correction step based on a 
measurement model. Analytical solutions for the optimal Bayesian filtering prob- 
lem are known only for certain special cases (e.g. Kalman filtering). For others, 
approximate techniques have been developed, such as extended Kalman [1] , par- 
ticle [2,4], and “unscented” filters [13]. In particular particle filters have become 
widespread, because of their great ease and flexibility in approximating com- 
plex pdfs, and dealing with a wide range of dynamical and measurement models 
Their multi-modal nature makes them particularly suited for object tracking in 
cluttered environments, where uni-modal techniques might get stuck and loose 
track. 

Several extensions have been proposed to the early particle filter techniques, 
e.g. dealing with discrete/continuous state spaces [9,11], multiple target tracking 
[12,15,21], or multiple sources of information [10,17]. The latter has involved 
techniques such as importance sampling [10] or democratic integration [17,19], 
and have been used to combine visual cues such as edge and texture in a particle 
filter framework. Particle filters have furthermore been applied in combination 
with low-level [14,15], high-level [9], exemplar-based [18], or mixed-level [11] 
object representations. 

In terms of representation, compact low-dimensional object parameteriza- 
tions can be obtained by linear subspace techniques, e.g. using shape (PDMs) 
[3,9], or texture [20]. However, these methods have some limitations concern- 
ing the global linearity assumption: nonlinear object deformations have to be 
approximated by linear combinations of the modes of variation. They are not 
the most compact representations for objects undergoing complex (non-linear) 




A Bayesian Framework for Multi-cue 3D Object Tracking 



243 



deformations, nor do they tend to be very specific, since implausible shapes can 
be generated, when invalid combinations of the global modes are used. 





Fig. 1. Feature spaces: Linear and locally-linear feature spaces 



Our approach, discussed in the next sections, builds upon the locally linear 
shape representation of [9] (see Figure 1). We extend this by a spatio-temporal 
shape representation, which does not utilize a common object parameterization 
for all possible shapes. Instead, a set of unconnected local parameterizations 
is used, which correspond to clusters of similar shapes. This allows our spatio- 
temporal shape representation to be fully automatically learned from training 
sequences of closed contours, without requiring prior feature correspondence. 

We model texture by means of histograms similar to [14,15]. However, we 
do not rely on circular/rectangular region primitives, but take advantage of the 
detailed shape information to derive appropriate object masks for texture ex- 
traction. Furthermore, unlike previous work, we derive a 3D tracking framework 
also incorporating stereo measurements for added robustness. 

Finally, our tracking framework integrates an independently operating object 
detection system by means of importance sampling. 



3 Multi-cue Object Representation 

3.1 Spatio-temporal Shape Representation 

Dynamic point distribution models capture object appearance by a set of lin- 
ear subspace models with temporal transition probabilities between them. This 
spatio-temporal shape representation can be learned automatically from example 
sequences of closed contours. See Figure 2. Three successive steps are involved. 

Integrated registration and clustering: At first an integrated registra- 
tion and clustering approach [8] is performed. The idea of integration is mo- 
tivated by the fact, that general automatic registration methods are not able 
to find the physically correct point correspondences, if the variance in object 
appearance is too high. This is in particular the case for self occluding objects, 
when not all object parts are visible for all time. Our proposed approach there- 
fore does not try to register all shapes into a common feature space prior to 
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Fig. 2. Learning Dynamic Point Distribution Models 



clustering. Instead the clustering is based on a similarity measure derived from 
the registration procedure. To be specific, the average distance between corre- 
sponding points after alignment. Only if this distance is lower than a user defined 
threshold, the shapes fall into the same cluster and the registration is assumed 
valid. For details, the reader is referred to [8]. 

Linear subspace decomposition: A principal component analysis is ap- 
plied in each cluster of registered shapes to obtain compact shape parameteri- 
zations known as “Point Distribution Models” (PDMs) [3]. From the N‘^ shape 
vectors of cluster c given by their u- and u-coordinates 



= ( 






' 2,1 ’ 2 ^ 2,2 5 
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( 1 ) 



the mean shape s*^ and covariance matrix is derived. Solving the eigensystem 
K^e^j = A°e‘^j one obtains the 2n'^ orthonormal eigenvectors, corresponding 
to the “modes of variation”. The most significant “variation vectors” E'^ = 
(ej,e2, ...,e^c), the ones with the highest eigenvalues A°, are chosen to cover a 
user specified proportion of total variance contained in the cluster. Shapes can 
then be generated from the mean shape plus a weighted combination of the 
variation vectors 

s° = -b (2) 

To ensure that the generated shapes remain similar to the training set, the 
weight vector b is constrained to lie in a hyperellipsoid about the subspace 
origin. Therefore b is scaled so that the weighted distance from the origin is less 
than a user-supplied threshold Mmax 
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Markov transition matrix: To capture the temporal sequence of PDMs 
a discrete Markov model stores the transition probabilities Tij from cluster i 
to j. They are automatically derived from the transition frequencies found in 
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the training sequences. An extension for covering more complex temporal events 
(e.g. [5]) is conceivable and straightforward. 



3.2 Modeling Texture Distributions 

The texture distribution over a region R = {ui,v\,U2,V2, ■■■, Unri,v„^), given by 
its riR u- and n-coordinates, is represented by a histogram = {0^}r=i,...,m, 
which is divided into m bins. It is calculated as follows 






= — ^6{h{ui,Vi) -r), 



( 4 ) 



whereas h{ui,Vi) assigns one of the m bins for the grey value at location Ui,Vi 
and 6 is the Kronecker delta function. 

To measure the similarity of two distributions 9 i = { 9 i}r=i,...,m and 62 = 
{ 02 }r=i,...,m we selected (among various possibilities [16]) the Bhattacharyya 
coefficient, which proved to be of value in combination with tracking [14,15] 

m 

= ( 5 ) 

r—1 



p{ 9 \, 62 ) ranges from 0 to 1, with 1 indicating a perfect match. The Bhat- 
tacharyya distance d( 0 i, 02 ) = \/l — p(^i, 6 * 2 ) can easily be calculated from the 
coefficient. 

For tracking, a reference distribution 9 is calculated at track initialization, 
which is updated over time to compensate for small texture changes. As in [14] 
the update is done with the mean histogram 9 observed under the shape of all 
particles 

= a9l + (1 - a)dr. (6) 

The user specified parameter a controls the contribution of the previous reference 
and the observed mean histograms. 



4 Bayesian Tracking 

In this work particle filtering is applied to approximate optimal Bayesian tracking 
[2,4] for a single target. The state vector x = (7T, S) of a particle comprises the 
position and velocity U = {x^y, z,Vx,Vy,Vz) in three dimensional space (a fixed 
object size is assumed), and the shape parameters S = (c, b) introduced in 
Section 3.1. For tracking, the dynamics p{xk^^\xk = s\) and the conditional 
density p{zk\xk = s].) have to be specified, whereas s], is the sample at time 
k. 



4.1 Dyuamics 

Object dynamics is assumed independent for the two components II and S of 
our state vector and is defined separately as follows. 
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During each sampling period the position and velocity vector II is assumed 
to evolve according to the following dynamic equation 

/ 1 0 0 Tfe 0 0 

0 1 0 0 Tfc 0 
_ 0 0 1 0 0 Tfc 

'=+1 0 0 0 1 0 0 

0 0 0 0 1 0 

\000 0 0 1 

whereas is the user defined process noise, which has to be chosen to account for 
velocity changes during each sampling interval T^, and Ifc is the time dependent 
noise gain [1,2]. 

The shape component S = (c, b) is composed of a discrete parameter Ck 
modeling the cluster membership and the continuous valued weight vector b. To 
deal with this “mixed” state the dynamics is decomposed as follows 

p{Sk+i\^k) = p(hk+i\ck+i, Sk)p{ck+i\Sk)- ( 8 ) 

Assuming that the transition probabilities Tij of our discrete Markov model are 
independent of the previous weight vector b^,, the second part of Equation 8 
reduces to 

— ^i,j' (^) 

For the continuous parameters we now have to consider two cases: In case of 
i = j, when no PDM transition occurs, we assume 

p(bfc+i|cfc+i = j,Ck = z,bfc) =pij(bfc+i|bfe) (10) 

to be a Gaussian random walk. For i ^ j the cluster is switched from i to j and 
the parameters b are assumed to be normally distributed about the mean shape 
of PDM j. 

4.2 Multi-cue Observation 

Three cues are integrated in this work, which contribute to the particle weights: 
shape, texture, and stereo. Their distributions are assumed conditionally inde- 
pendent so that 

~ ^k) ~ Pshapei.^h\^k ~ Ptexture{^k\^k ~ Pstereo(^^k\^k ~ ^k)’ 

(11) 

Since the shape and texture similarity measures between the prediction and 
observation are defined in the image plane, the shape of each particle is generated 
using Equation 2. Its centroid coordinates u, v and the scale factor s are derived 
using a simple pinhole camera model with known (intrinsic/extrinsic) parameters 
from the 3D-coordinates cc, y, z and the specified 3D object dimensions. 

Shape: A method based on multi-feature distance transforms [6] is applied 
to measure the similarity between the predicted shapes and the observed image 
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edges. It takes into account the position and direction of edge elements. Formally, 
if the image I is observed at time k and S is the shape of particle s]. we define 



Pshape{^k\^k 



4) oc exp(-o;^^ape(T^y'£'/(s))^)) 

' ' ses 



(12) 



whereas |5| denotes the number of features s in S, Di{s) is the distance of the 
closest feature in / to s, and a shape is a user specified weight. 

Texture: For texture, ptexture{zk\xk = s\) is defined as 

Ptexture{,^k\Xk — OC exp( O^textured (uJ,l?)), (13) 



whereas d{uj, 9) is the Bhattacharyya distance described in Section 3.2 between 
the reference distribution 6 and the observed texture distribution lo under the 
shape of particle s^. Like above, atexture is a user defined weighting factor. 

Stereo: A stereo vision module generates a depth image I depth, which con- 
tains the distance to certain feature points. To measure the depth dstereo of par- 
ticle s\. the distance of the feature points under its shape are averaged. Given 
the predicted distance z of s\ and the measurement dstereo, we define 

Pstereoi,^k\Xk — ^k) ^ OXp( C^stereoi^dstereo -^) )i (14) 



whereas astereo is a weighting factor. 



4.3 Integrated Detection and Tracking 

A set of particle filters is used to track multiple objects in this work. Each is in 
one of the states active or inactive. An active track is either visible or hidden. 

A detection system provides possible object locations, which are modeled as 
“mixture” of Gaussians, whereas one component corresponds to one detection. 
The mixture is exploited in two ways: As importance function for the particle 
filters and for the initialization and termination of tracks. 

The following rules, which depend on the actual detections, observations, and 
geometric constraints control the evolution of tracks. 

A track is initialized, if no mean state of an active track is in the 3cr-bound 
of a detection. To suppress spurious measurements it starts hidden. Initializa- 
tion involves drawing the 3D position and velocity of the particles according to 
the Gaussian of the detection. Since no shape information is provided by the 
detection system, the cluster membership and the continuous parameters are 
randomly assigned. After the first particle weighting the reference texture dis- 
tribution is initialized with the mean histogram observed under the shape of all 
particles. 

A track is visible, if it has at least G associated detections, if the last associ- 
ated detection is at most t 2 time steps old, and if the actual match values were 
better than user defined thresholds for the last t^ times. Otherwise the track is 
hidden. 

A track becomes inactive, if the prediction falls outside the detection area or 
image, if the actual match values were worse than user specified thresholds for 
O successive times, or if a second track is tracking the same object. 
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4.4 Sampling 

For particle filtering an algorithm based on icondensation [10] is applied. It inte- 
grates the standard factored sampling technique of condensation [2] , importance 
sampling, and sampling from a “reinitialization” prior probability density. 

This allows us to integrate the mixture of Gaussians from the detection sys- 
tem as an importance function into the tracking framework. Like in [10], it is also 
used as a reinitialization prior, which gives us the possibility to draw samples 
independently of the past history using the Gaussian of the nearest detection. 

Like in [9,11] a two step sampling approach is followed for the decomposed 
dynamics of the mixed discrete/continuous shape space. At first the cluster of our 
shape model is determined using the transition probabilities Tij and afterwards 
the weight vector b is predicted according to the Gaussian assumptions described 
in Section 4.1. 

5 Experiments 

To evaluate our framework we performed experiments on pedestrian tracking 
from a moving vehicle. The dynamic shape model, outlined in Section 3.1, was 
trained from approximately 2500 pedestrian shapes of our training set. The 
resulting cluster prototypes and the temporal transition probabilities between 
the associated PDMs are illustrated in Figure 3. As expected (and desired), the 
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Fig. 3. Cluster transition matrix: The squares represent the transition probabilities 
from a PDM of column j to row i. 



diagonal elements of the transition matrix contain high values, so that there is 
always a high probability of staying in the same cluster during tracking. Figure 
4 shows three random trajectories generated with the proposed dynamic shape 
model assuming that a camera is moving at 5m/s towards the object in 3D- 
space, which is moving laterally at Im/s. Each greyscale change corresponds to 
a PDM transition. 

Pedestrian detection is performed by the GhamferSystem in the experiments. 
It localizes objects according to their shape in a coarse to fine approach over a 
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Fig. 4. Predicting shape changes using Dynamic Point Distribution Models: Three ran- 
dom trajectories assuming that the camera is moving at 5m/s towards the object, which 
is moving laterally at Im/s. Each greyscale change corresponds to a PDM transition. 



template hierarchy by correlating with distance transformed images. For details 
the reader is referred to [7]. The 3D position is derived by backprojecting the 
2D shape with our camera model, assuming that the object is standing on the 
ground. 

The tracking system was tested on a 2.4GHz standard workstation and needs 
about 300ms/frame for an active track and an image resolution of 256 x 196. 
The number of particles is set to 500 in the experiments. 

During tracking, the a-priori and a-posteriori probability of each PDM can 
be observed online, as shown in Figure 5. The size of the dark and light grey 
boxes indicate the a-priori and a-posteriori probability respectively. The more 
similar they are, the better the prediction. 




Fig. 5. Tracking results: The dark box indicates the a-priori and the light the a- 
posteriori confidence of each cluster. The larger the box the higher the probability. 
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Fig. 6. Tracking results: In the left images the best sample is shown for each track. 
In addition the detections are illustrated as boxes. The shapes of all particles, which 
approximate the posterior pdf, are drawn in the middle. Finally a top view of the scene 
can be viewed on the right. 
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Table 1. Average distances of the true and estimated pedestrian locations for a se- 
quence of 34 images. 



cues 


distance 


lateral error 


error in depth 


edge 


1.37m 


0.085m 


1.34m 


edge -I- texture 


1.28m 


0.14m 


1.25m 


edge -I- texture -I- stereo 


1.05m 


0.11m 


1.03m 



Results of the overall system are given in Figure 6 for urban, rural, and 
synthetic environments. In the original images (left column) the best sample 
of each active track is shown. Whenever detections are observed, they are also 
represented there as grey boxes. The shapes of all particles, which approximate 
the posterior pdf, are drawn in the middle column. Finally, a top view of the 
scene can be viewed on the right. The past trajectories are represented by small 
circles while the current position estimate is marked by big circles. The text 
contains the actual distance and velocity estimates. 

To substantiate the visually observable improvement due to the integration of 
shape, texture, and stereo information, the position estimates of the system were 
compared against ground truth. Table 1 shows the results for the first sequence 
of Figure 6, which consists of 34 images. As expected, the average error in depth 
is higher than the lateral and an improved performance due to the integration 
of multiple cues can be observed. 

6 Conclusions 

This paper presented a general Bayesian framework for multi-cue 3D deformable 
object tracking. A method for learning spatio-temporal shape representations 
from examples was outlined, which can deal with both continuous and discon- 
tinuous appearance changes. Texture histograms and direct 3D measurements 
were integrated, to improve the robustness and versatility of the framework. It 
was presented how measurements from an independently operating detection 
system can be integrated into the tracking approach by means of importance 
sampling. Experiments show, that the proposed framework is suitable for track- 
ing pedestrians from a moving vehicle and that the integration of multiple cues 
can improve the tracking performance. 
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Abstract. Classifying materials from their appearance is a challenging problem, 
especially if illumination and pose conditions are permitted to change: highlights 
and shadows caused by 3D structure can radically alter a sample’s visual texture. 
Despite these difficulties, researchers have demonstrated impressive results on the 
CUReT database which contains many images of 61 materials under different 
conditions. A first contribution of this paper is to further advance the state-of-the- 
art hy applying Support Vector Machines to this problem. To our knowledge, we 
record the best results to date on the CUReT database. 

In our work we additionally investigate the effect of scale since robustness to view- 
ing distance and zoom settings is crucial in many real-world situations. Indeed, a 
material’s appearance can vary considerably as hne-level detail becomes visible 
or disappears as the camera moves towards or away from the subject. We handle 
scale-variations using a pure-leaming approach, incorporating samples imaged at 
different distances into the training set. An empirical investigation is conducted to 
show how the classification accuracy decreases as less scale information is made 
available during training. 

Since the CUReT database contains little scale variation, we introduce a new 
database which images ten CUReT materials at different distances, while also 
maintaining some change in pose and illumination. The first aim of the database is 
thus to provide scale variations, but a second and equally important objective is to 
attempt to recognise different samples of the CUReT materials. For instance, does 
training on the CUReT database enable recognition of another piece of sandpaper? 
The results clearly demonstrate that it is not possible to do so with any acceptable 
degree of accuracy. Thus we conclude that impressive results even on a well- 
designed database such as CUReT, does not imply that material classification is 
close to being a solved problem under real-world conditions. 



1 Introduction 

The recognition of materials from their visual texture has many applications, for instance 
it facilitates image retrieval and object recognition. As a step towards the use of such 
techniques in the real world, recent developments have concentrated on being able to 
recognise materials from a variety of poses and with different illumination conditions 
[16,9,31]. This is a particularly challenging task when the material has considerable 3- 
dimensional structure. With such 3D textures, cast shadows and highlights can cause the 
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Fig. 1. Three images of white bread taken from the CUReT database demonstrating the variation 
of appearance of a 3D texture as the pose and illumination conditions change. 



appearance to change radically with different viewing angles and illumination conditions. 
An example from the CUReT database [10] (white bread) is given in Fig. 1. 

The overall goal of our work is to bring material recognition algorithms closer still 
to the stage where they will be useful in real-world applications. Thus a major objective 
is providing robustness to variations in scale. Experiments will show that failure in 
this regard rapidly leads to a deterioration in classification accuracy. Our solution is a 
pure-learning approach which accommodates variations in scale in the training samples, 
similar to how differing illumination and pose are modelled. 

A further contribution concerns demonstrating the suitability of Support Vector Ma- 
chines (SVMs) [8,29] as classifiers in this recognition problem. Experiments show 
that the SVM classifier systematically outperforms the nearest-neighbour classifica- 
tion scheme adopted by Varma and Zisserman [31] with which we compare our results, 
and we also demonstrate that we achieve an improvement on their Markov Random 
Field (MRF) approach [32] which, to our knowledge, previously yielded the best overall 
classification rate on the CUReT database. 

As already alluded to, experiments are conducted on the CUReT image database [10] 
which captures variations in illumination and pose for 61 different materials, many of 
which contain significant 3D structure. This database does not, however, contain many 
scaling effects. Some indication of the performance under varying scale can be achieved 
by artificially scaling the images by modifying the scales of the filters in the filter bank. 
However, we also investigate classification results on pictures of materials present in 
the CUReT database, imaged in our laboratory. The objectives of these experiments are 
two-fold. First, it permits a systematic study of scale effects while still providing some 
variations in pose and illumination. Second, we investigate whether it is possible to 
recognise materials in this new database given models trained on the CUReT database. 
This indeed proves a stern test, since both the sample of material, the camera and lighting 
conditions are different to those used during training. 

Thus the final contribution of this paper is the construction of a new database, de- 
signed to complement the CUReT database with scale variations. This database, called 
KTH-TIPS (Textures under varying Illumination Pose and Scale) is freely available to 
other researchers via the web [12]. 

The remainder of the paper is organised as follows. Section 2 reviews previous 
literature in the field. Particular emphasis is placed on the algorithm of Varma and 
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Zissertnan [31] on which we ourselves to a large extent build. Section 3 discusses the 
application of Support Vector Machines to this problem, and also presents experiments 
which demonstrate their superior performance relative to the original approach of [3 1 ] . 
Further experiments in the paper also make use of S VMs. Then, Section 4 discusses issues 
with scale, presents a pure learning approach for tackling the problem, and conducts 
experiments on the CUReT database. Section 5 introduces the new database designed 
to supplement the CUReT database for experiments with scale. Conclusions are drawn 
and potential avenues for future research outlined in Section 6. 

2 Previous Work 

Most work on texture recognition [21,23,14] has dealt with planar image patches sam- 
pled, for instance, from the Brodatz collection [4]. The training and test sets typically 
consist of non-overlapping patches taken from the same images. More recently, however, 
researchers have started to combat the problems associated with recognising materials 
in spite of varying pose and illumination. Leung and Malik [16] modelled 3D materi- 
als in terms of texton histograms. The notion of textons is familiar from the work of 
Julesz [13], but it was only recently defined for greyscale images as a cluster centre in 
a feature space formed by the output of a filter bank. Given a vocabulary of textons, the 
filter output of each pixel is assigned to its nearest texton, and a histogram of textons is 
formed over an extended image patch. This procedure was described for 2D textures in 
[20] and for 3D textures in [16] by stacking geometrically registered images from the 
training set. Recognition is achieved by gathering multiple images of the material from 
the same viewpoints and illuminations, performing the geometric registration, comput- 
ing the texton histogram and classifying it using a nearest-neighbour scheme based on 
the distance between model and query histograms. 

Cula and Dana [9] adapted the method of Leung and Malik to form a faster, simpler 
and more accurate classifier. They realised fhat fhe 3D registration was not necessary, and 
instead described a material by multiple histograms of 2D textons, where each histogram 
is obtained from a single image in the training set. This also implies that recognition is 
possible from a single query image. 

Varma and Zisserman [31] argued strongly for a rotationally invariant filter bank. 
First, two images of the same material differing only by an image-plane rotation should 
be equivalent. Second, removing the orientation information in the filter bank consid- 
erably reduced the size of the feature vector. Third, it led to a more compact texton 
vocabulary since it was no longer necessary for one texton to be a rotated version of 
another. Rotational invariance was achieved by storing only the maximum response over 
orientation of a given type of filter at a given scale. As Fig. 2 indicates, the filter bank con- 
tains 38 filters, but only 8 responses are stored, yielding the so-called MR8 (Maximum 
Response 8) descriptor. Not only did the use of this descriptor reduce storage require- 
ments and computation times, an improvement in recognition rate was also achieved. In 
their experiments [31] they use 92 of the 205 images in the CUReT database, removing 
samples at severely slanted poses. Splitting these 92 images of each material equally 
into 46 images for training and 46 images for the test set, they obtain an impressive 
classification accuracy of up to 97.43% [32]. This is the system that we will be using as 
a reference in our own experiments. 
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Fig. 2. Following [31] we use a filter bank consisting of edge and bar filters (first and second 
Gaussian derivatives) at 3 scales and 6 orientations, and also a Gaussian and Laplacian. Only the 
maximum response is stored for each orientation, yielding the 8-dimensional MRS descriptor. 



Many different descriptors have been proposed for texture discrimination. Filter 
banks are indeed very popular [21,16,9,31,24], and there is evidence that biological 
systems process visual stimuli using filters resembling those in Fig. 2. Flowever, non- 
filter descriptors have recently been regaining popularity [11,32,19,15]. [32] presents 
state-of-the-art results on the CUReT database using a Markov Random Field (MRF) 
model. Maaenpaa and Pietikainen [19] extend the Local Binary Pattern approach [23] to 
multiple image resolutions and obtain near-perfect results on a test set from the Outex 
database. However, this database does not contain any variations in pose or illumina- 
tion, and the variation in scale is rather small (100dpi images in the training set and 
120dpi images in the test set). Recent, impressive work by Lazebnik etal. [15] considers 
simultaneous segmentation and classification of textures under varying scale. Interest 
points are detected, normalised for scale [18], skew and orientation, and intensity domain 
spin images computed as descriptors. Each interest point is assigned to a texture class 
before a relaxation scheme is used to smooth the response. It remains to be seen, how- 
ever, whether this scheme can handle large variations in illumination, and the number 
of classes in their experiments is rather small. Scale-invariant recognition using Gabor 
filters on Brodatz textures was considered by Manthalkar et al. [22]. 



3 Using Support Vector Machines for Texture Classification 

The first contribution of this paper is to demonstrate that recent advances in machine 
learning prove fruitful in material classification. Support Vector Machines are state- 
-of-the-art large margin classifiers which have gained popularity within visual pattern 
recognition, particularly for object recognition. Pontil and Verri [26] demonstrated the 
robustness of SVMs to noise, bias in the registration and moderate amounts of occlusion 
while Roobaert et al. [27] examined their generalisation capabilities when trained on 
only a few views per object. Barla et al. [2] proposed a new class of kernel inspired 
by similarity measures successful in vision applications. Other notable work includes 
[17,5,1]. Although SVMs have previously been used on planar textures [14], they have 
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not, to our knowledge, been applied to 3D material classification under varying imaging 
conditions. 

Before demonstrating in experiments the improvements that can be achieved with 
SVMs, we provide a brief review of the theory behind this type of algorithm. For a more 
detailed treatment, we refer to [8,29]. 



3.1 Support Vector Machines: A Review 

Consider the problem of separating a set of training data (a;i, j/i), {x 2 ,i/ 2 )---{xm, Um), 
where Xi G is a feature vector and G {— 1,+1} its class label. If we assume 
that the two classes can be separated by a hyperplane w ■ x + b = 0, and that we have 
no prior knowledge about the data distribution, then the optimal hyperplane (the one 
with the lowest bound on the expected generalisation error) is that which has maximum 
distance to the closest points in the training set. The optimal values for w and b can be 
found by solving the following constrained minimisation problem: 

minimise - II subjectto yi{w ■ Xi + b) > l,\/i = 1, . . .m (1) 

w,b 2 

Introducing Lagrange multipliers ai{i = 1, . . . m) results in a classification function 



f{x) = sign [ ^ aty^w -x + b 



where ai and b are found by Sequential Minimal Optimisation (SMO, [8,29]). Most of 
the ai’s take the value of zero; those Xi with nonzero ai are the “support vectors”. In 
cases where the two classes are non-separable, Lagrange multipliers are introduced, 0 < 
ai < C,i = 1, . . .m, where C determines the trade-off between margin maximisation 
and training error minimisation. To obtain a nonlinear classifier, one maps the data from 
the input space to a high dimensional feature space T~Lhy x ^ <P{x) G H, such 
that the mapped data points of the two classes are linearly separable in the feature space. 
Assuming there exists a kernel function K such that K{x,y) = <P{x) ■ <P{y), a nonlinear 
SVM can be constructed by replacing the inner product w ■ x hy the kernel function 
K{x, y) in eqn. (2). This corresponds to constructing an optimal separating hyperplane 
in the feature space. Kernels commonly used include polynomials K{x, y) = {x ■ yY, 
and the Gaussian Radial Basis Function (RBF) kernel K{x^ y) = exp{— 7 ||a; — y|p}. 
The Gaussian RBF has been found to perform better for histogram-like features [7,5], 
thus unless specified otherwise, this is the kernel we will use in the present paper. 

The extension of SVM from 2-class to M-class problems can be achieved following 
two basic strategies: In a one-vs-others approach, M SVMs are trained, each separating a 
single class from all remaining classes. Although the most popular scheme for extending 
to multi-class problems (see for instance [8,5,7]), there is no bound on its generalisation 
error, and the training time of the standard method scales linearly with M [8]. In the 
second strategy, the pairwise approach, M{M — l)/2 two-class machines are trained. 
The pairwise classifiers are arranged in trees, where each tree node represents an SVM. 
Decisions can be made using a bottom-up tree similar to the elimination tree used in 
tennis tournaments [8], or a Directed Acyclic Graph (DAG, [25]). 
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3.2 Results 

Platt and others [25] presented an analysis of the generalisation error for DAG, indicating 
that building large margin DAGs in a high dimensional feature space can yield good 
generalisation performance. On the basis of this result and of several empirical studies, 
we used a pairwise approach with DAG in this paper, using the LibSVM library [ 6 ]. C 
was fixed at 100 whereas 7 in the RBF was obtained automatically by cross-validation. 
The histograms were treated as feature vectors and normalised to unit length. 

We compared the SVM classifier with our own implementation of the algorithm of 
Varma and Zisserman [31], which from now on will be denoted the VZ algorithm, and 
we use the same 200 x 200 pixels greyscale image patches as they do. The patches are 
selected such that only foreground is present. 

A first experiment ascertains the maximum performance that can be achieved on 
the CUReT database by using a very large texton vocabulary. 40 textons were found 
from each of the 61 materials, giving a total dictionary of 40 x 61 = 2440 textons. 
The 92 images per sample were split equally into training and test sets. Varma and 
Zisserman [32] previously reported a 97.43% success rate, while our own implementation 
of their algorithm gave an average of 97.66% with a standard deviation of 0.11% over 
10 runs^ In contrast, the SVM classifier gave 98.36 ± 0.10% using an RBF kernel and 
98.46 ± 0.09% using the kernel K = exp{— 7 %^}. We implemented this Mercer 
kernel [3] within LibSVM. This performs better even than the very best result obtained 
in [32] using an MRF model (98.03%) which, to our knowledge, previously represented 
the best overall classification rate on the CUReT database. 

Another natural extension to the Varma and Zisserman algorithm is to replace the 
Nearest Neighbour classifier with a ^-Nearest Neighbour scheme. Several variants of 
fc-NN were tried with different strategies to resolve conflicts [28]. Of these. Method 2 
from [28] proved best in our scenario, but no variant yielded an improved recognition 
rate for any choice of A: > 1. This is probably due to a relatively sparse sampling of the 
pose and illumination conditions in the training set. 

Further experiments examine the dependency on the size of the training set (Fig. 3a) 
and the texton vocabulary (Fig. 3b). Both plots clearly demonstrate that the SVM clas- 
sifier reduces the error rate by 30 - 50% in comparison with the method of [31]. In 
both experiments, textons were found from the 20 materials specified in [16] rather than 
all 61 materials. In Fig. 3a, 10 textons per material are used, giving a dictionary of 
20 X 10 = 200 textons. In Fig. 3b, the training set consists of 23 images per material, 
and the remaining 69 images per material are placed in the test set. 



* The variability within experiments is due to slightly different texton vocabularies; images are 
selected at random when generating the dictionary with K-means clustering. The difference of 
0.23% between our results and the figure of 97.43% reported in [32] is caused by our use of 
more truncated filter kernels (41 x 41 compared to 49 x 49 [30]) although the scales used to 
compute the kernels were identical. For a texton to be assigned to a pixel, the entire support 
region of the filter kernel is required to lie inside the 200 x 200 image patch. Thus the texton 
histograms contain more entries when a smaller filter kernel is used. It may be noted that the 
MRF algorithm of [32] computes descriptors from significantly smaller regions, for instance 
7x7. 
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Fig. 3. Experiments comparing our S VM scheme with the VZ [3 1 ] approach, (a) plots the reliance 
on the number of views in the training set, (b) the dependency on the size of the texton vocabulary, 
and (c) the size of the stored model. In (c) the model reduction schemes of [31,32] were not 
implemented. 



Training times for SVM vary from about 20 seconds (with a vocabulary of 100 
textons, 12 views per material in the training set) up to roughly 50 minutes (for 2440 
textons, 46 views per material). Finding 7 by cross-validation, if required, typically 
incurs a further cost of 3-7 times the figures reported above. 

The size of the resulting model is illustrated in Fig. 3c. Recalling that only the 
support vectors need be stored, and noting that storing the coefficients incurs little 
overhead, SVM reduces the size of the model by 10 - 20%. This is significantly less 
than the reduction by almost 80% obtained using the greedy algorithms described in 
[31] and [32]. Flowever, the scheme in [31] used the test set for validating the model, 
which is unreasonable in a recognition task, while the method in [32] was extremely 
expensive in training, in fact by a few orders of magnitude [30] in comparison with the 
more expensive times listed for SVM above. Moreover, their procedure for selecting a 
validation set from the training set is largely heuristic and at a high risk of over-fitting, 
in which case the performance on the test set would drop very significantly [30]. 



4 Material Classification under Variations in Scale 



The results presented so far on the CUReT database were obtained without significant 
scale variation in the images In the real world, scale undoubtedly plays an important 
role, and it seems unlikely that the classifiers described so far will perform well. First, 
the individual filters are tuned to certain frequencies, and zooming in or out on a texture 
changes the characteristic frequencies of its visual appearance. Second, zooming in on a 
texture can make visible fine-level details which could not be recorded at coarser scales 
due to the finite resolution of the imaging device. Examples are given in Fig. 4. With 
cotton, for instance, at a coarse scale a vertical line structure is just about visible, whereas 
at a fine scale the woven grid can be seen clearly, including horizontal fibres. 

^ Four samples are zoomed in images of other materials. In the experiments reported in this paper, 
classifying one material as the zoomed in version of that same material is labelled an incorrect 
match. In practise such confusions are fairly common for those four materials, but this does not 
have a very large effect on classification rates when averaged over all materials. 
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(i) Distant (ii) Close (i) Distant (ii) Close (i) Distant (ii) Close 

(a) Cotton (b) Sandpaper (c) Sponge 

Fig. 4. The appearance of materials can change dramatically with distance to the camera. 
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Fig. 5. Variations in scale can have a disastrous effect. In this experiment the training set contains 
images only at the default scale whereas the test set contains images rescaled by amounts up to a 
factor of two both up and down. For sandpaper (a) the recognition rate drops dramatically, whereas 
for sponge (b) they are more stable, probably since the salient features are repeated over a wide 
range of scales. Results averaged over the entire CUReT database are shown in (c). 



4.1 A Motivational Experiment 

Experimental confirmation of the scale-dependence of the texton-histogram based 
schemes was obtained by supplementing the CUReT database with artificially scaled 
versions of ifs images. Rather than rescaling the images, which raises various issues with 
respect to smoothing and aliasing, \he. filters were rescaled. For instance, reducing the 
size of the image (zooming out) by a factor of two is equivalent to doubling the standard 
deviations in the filters. This procedure was repeated at eight logarithmically spaced 
intervals per octave, scaling both up and down one octave. This resulted in 2 x 8 = 16 
scaled images in addition to the unsealed original, giving a total of 17 images. Only the 
unsealed images were placed in the training set, whereas recognition was attempted at 
all 17 scales The 92 images per sample were split evenly into training and test sets, 
and a texton vocabulary of 400 textons was used. 

Fig. 5 illustrates this dependency on scale for two materials. Sandpaper (Fig. 5a), 
shows almost no robustness to changes in scale, whereas sponge (Fig. 5b) is much more 
resilient. These effects can be attributed to two main factors. The first concerns intra- 
class properties: materials with a highly regular pattern have a clear characteristic scale, 
whereas others, such as sponge, exhibit similar features over a range of scales. The 

^ We acknowledge that this method is no true replacement for real images since (i) it is not 
possible to increase the resolution while artificially zooming in, and (ii) the information content 
is reduced somewhat when artificially zooming out since the size of the 200 x 200 pixels patch 
is effectively reduced. 
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Table 1. The recognition rate (in %) on the artificially rescaled CUReT database as the richness 
of the model is varied both with respect to the sampling density in the scale direction and in 
how many of the original 92 images are incorporated in the training set (per scale). With 3 scales 
present, the training set includes the original image and also samples at scales one octave up and 
one octave down. With five scales, half-octave positions are made available during training, and 
with 9 scales, quarter-octave positions are also used. 
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(a) SVM (b) Varma and Zisserman [31] 



feature vector for the former material could be severely mutated, whereas we expect 
the descriptor of the latter to be more robust to changes in scale. The second factor 
depends on the inter-class variation in the database: the recognition rate depends on the 
degree of distraction caused by other materials. It is feasible that a material imaged at 
a certain scale closely resembles another material at the default scale. Fig. 5c shows 
corresponding plots for an average over all 61 materials in the CUReT database. 

4.2 Robustness to Scale Variations: A Pure Learning Approach 

The experiment described above indicated that providing robustness to changes in image 
scale can be crucial if material recognition is to function in the real world. A natural 
strategy for providing such robustness is to extend the training set to cover not just varia- 
tions in pose and illumination conditions, but also scale. An alternative, left unexplored 
here, would be to include only images at one scale during training, but then artificially 
rescale the query image to a number of candidate scales by rescaling the filter bank. 

An open question is how densely it is necessary to sample in the scale direction, 
particularly since the size of the training set has obvious implications for algorithm speed 
and memory requirements. Clearly there will be some dependence on the bandwidth of 
the filters, but the amount of inter-class variation will also be of consequence. 

This dependence on sampling in the scale dimension was ascertained empirically 
on the rescaled CUReT database, and our findings are summarised in Tables la and 
b for the SVM and VZ classifiers respectively with a vocabulary of 400 textons. The 
most noteworthy aspect of these results is that impoverishing the model in the scale 
dimension appears to have a more severe effect than reducing the size of the training 
set with respect to the proportion of the original 92 images which were placed in the 
training set. Both SVM and the VZ schemes exhibit such behaviour. A further point 
worth emphasising is that SVM systematically outperforms the VZ classifier, as was 
also seen in Section 3. Again, we attempted replacing the Nearest Neighbour classifier 
in the Varma and Zisserman approach with fc-Nearest Neighbour schemes, but without 
observing any improvement for k > 1. 
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(b) The variation of pose and illumination present in the KTH-TIPS database. 



Fig. 6. The variations contained in the new KTH-TIPS (Textures under varying Illumination Pose 
and Scale) database. In (a) the middle image, depicting the central scale, was selected to correspond 
roughly to the scale used in the CUReT database. The left and right images are captured with the 
sample at half and twice that distance, respectively. 3 further images per octave (not shown) are 
present in the database, (b) shows 3 out of 9 images per scale, showing the variation of pose and 
illumination. Prior to use, images were cropped so only foreground was present. 



5 The KTH-TIPS Database of Materials under Varying Scale 

Although the results presented above gave some indication as to the deterioration in 
performance under changes in scale, the artificial rescaling is no perfect replacement for 
real images. Therefore we created a new database to supplement CUReT by providing 
variations in scale in addition to pose and illumination. Thus we named it the KTH-TIPS 
(Textures under varying Illumination Pose and Scale) database. A second objective with 
the database was to evaluate whether models trained on the CUReT database could be 
used to recognise materials from pictures taken in other settings. This could indeed prove 
challenging since not only the camera, poses and illuminant differ, but also the actual 
samples: can another sponge be recognised using the CUReT sponge? 

To date, our database contains ten materials also present in the CUReT database. 
These are sandpaper, crumpled aluminium foil, styrofoam, sponge, corduroy, linen, 
cotton, brown bread, orange peel and cracker B. These are imaged at nine distances from 
the camera to give equidistant log-scales over two octaves, as illustrated in Fig. 6a for 
the cracker. The central scale was selected, by visual inspection, to correspond roughly 
to the scale used in the CUReT database. At each distance images were captured using 
three different directions of illumination (front, side and top) and three different poses 
(central, 22.5° turned left, 22.5° turned right) giving a total of 3 x 3 = 9 images per 
scale, and 9 x 9 = 81 images per material. A subset of these is shown in Fig. 6b. For 
each image we selected a 200 x 200 pixels region to remove the background. 

The database is freely available on the web [12]. 

We now present three sets of experiments on the KTH-TIPS database, differing in 
how the model was obtained. The first uses the CUReT database for training, the second 
a combination of both databases, and the third only KTH-TIPS. 
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(a) Sandpaper (b) Sponge (c) Corduroy 



Fig. 7. Experiments attempting to recognise images from the new KTH-TIPS database using a 
model trained on all 61 materials of the CUReT database. The recognition rate is plotted against 
scale for three materials. 



Using the CUReT database for training. We attempted to recognise the materials 
in KTH-TIPS using a model obtained by training on the 61 materials of the CUReT 
database. 46 out of 92 images per material were placed in the training set. To cope 
with variations in scale, the procedure described in Section 4.2 is used: the model is 
acquired by rescaling each training sample from the CUReT database by adapting the 
Gaussian derivative filters. For this experiment the training set contained data from 9 
scales, equidistantly spaced along the log-scale dimension over two octaves. 

Results for sandpaper, sponge and corduroy can be seen in Fig. 7a, b and c respec- 
tively. Performance on sandpaper is very poor. This failure could be due to differences 
between our sample of sandpaper and the CUReT sample of sandpaper, despite our 
efforts to provide similar samples. We did, however, note that sandpaper was a very 
difficult material to recognise also in experiments using the CUReT database as the test 
set. This indicates that many of the other materials can be confused with sandpaper. 

Results were much improved for sponge and corduroy where recognition results of 
around 50% were achieved. It is interesting to note that the VZ classifier outperformed 
SVM in these experiments. The success rate of the VZ approach varies considerably with 
scale. It would seem that there is not perfect overlap between the two octaves in scale 
in the two datasets. Another explanation for a drop-off in performance at fine scales is 
that the rescaling of the CUReT database cannot improve the resolution: rescaling the 
filters does not permit sub-pixel structure to appear. A third reason is that the images 
closest to the camera were poorly focused in some cases. The SVM classifier provided 
much more consistent results over varying scales, as could perhaps be expected from 
the experiment reported in Table 1. However, the recognition rate was consistently fairly 
low over all scales. By supplying a test set too different to the samples provided during 
training, we are asking the SVM to perform a task for which it was not optimised; S VMs 
are designed for discrimination rather than generalisation. 

The recognition rates for all 10 materials, averaged over all scales, is provided in 
Table 2a. Results are, on the whole, well below 50%, clearly demonstrating that material 
recognition cannot be performed reliably in the real world merely using the CUReT 
database to form the model. We have, however, confirmed that many of the confusions 
are reasonable. For instance, cotton was frequently confused with linen. 
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Table 2. Attempting to recognise samples from the KTH-TIPS database. Results are averaged 
over all scales. 



Material 


Recognitii 

SVM 


an rate (%) 
VZ 


sandpaper 
aluminium foil 
styrofoam 
sponge 
corduroy 
linen 
cotton 
brown bread 
orange peel 
cracker B 


77.78 

91.67 

100.00 

100.00 

80.56 

61.11 

61.11 

77.78 

100.00 

91.67 


66.67 

88.89 

91.67 
100.00 

80.56 

41.67 
47.22 
80.56 

63.89 
80.56 


AVERAGE 


84.17 


74.17 



Material 
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SVM 


n rate (%) 
VZ 


sandpaper 


0.00 


1.23 


aluminium foil 


11.35 


12.35 


styrofoam 


34.72 


38.27 


sponge 


50.62 


54.32 


corduroy 


46.91 


59.26 


linen 


30.41 


25.93 


cotton 


11.11 


20.99 


brown bread 


5.11 


7.41 


orange peel 


11.11 


11.11 


cracker B 


3.70 


7.41 


AVERAGE 


20.50 


23.83 



(a) Training only on CUReT (b) Training on both CUReT and KTH-TIPS 



Using a combination of databases for training. In a second experiment we 
combined the CUReT and KTH-TIPS databases for training. Thus we no longer needed 
to worry about training and tests being performed on different samples, but now some 
classes in the model contained a wider variety, thus increasing the risk of classes overlap- 
ping in the feature space. We report experimental results for training with 5 equidistant 
scales in the log-scale dimension, spanning two octaves. For KTH-TIPS materials, at 
each scale 3 out of 9 images in the KTH-TIPS database were used for training, as were 
43 images from the CUReT database. This same total number of 46 training images per 
scale was also used for the 5 1 materials only found in CUReT; these were included as 
distractors in the experiment. Results are summarised in Table 2b. As expected, including 
the KTH-TIPS samples in the training set yielded much better results; the average over 
all materials increased to 84.17% for SVM and 79.17% for VZ . 

Training on KTH-TIPS. We also performed similar experiments using only the 
KTH-TIPS database for training, implying that the model contained only 1 0 classes rather 
than 61. Thus there are fewer distractions, and the overall recognition rate increased to 
90.56% for SVM and 84.44% for VZ with 5 scales. Using only the central scale resulted 
in classification rates of 64.03% and 59.70% for SVM and VZ respectively. We will not 
report results from these experiments further. 



6 Discussion and Conclusions 

This paper attempted to bring material classification a step closer to real-world appli- 
cations by extending work on 3D textures under varying pose and illumination to also 
accommodate changes in scale. We showed in experiments that it is crucial to model 
scale in some manner, and we demonstrated a scale-robust classifier which incorporates 
the variations in scale directly into the training set. Experiments were conducted both 
on an artihcially rescaled version of the CUReT database, and on a new database de- 
signed to supplement the CUReT database by imaging a subset (currently 10 out of 61) 
of the materials at a range of distances, while still maintaining some variation in pose 
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and illumination. This database represents the second contribution of this paper, and is 
available to other researchers via the web [12]. 

A third contribution was to demonstrate the superiority of Support Vector Machines 
(SVMs) in this application. We obtained a recognition rate of 98.46% on the CUReT 
database at constant scale which, to our knowledge, represents the highest rate to date. 

However, a more sobering conclusion, and perhaps the most important message from 
this paper, is that such success on the CUReT database does not necessarily imply that it is 
possible to recognise those materials in the real world, even when scale is modelled. The 
main reason is probably that the samples imaged in our laboratory were not identical to 
those in CUReT. Naturally it is possible to include multiple samples of the same material 
in a database, but with increased intra-class variability, the risk of inter-class confusion 
increases. As this risk depends on the number of classes in the database, keeping this 
number low (e.g. in production line applications) should make it feasible to separate the 
classes, but with a large number it might only be possible to classify into broader groups 
of materials. The performance will again depend on scale since most materials appear 
more homogeneous with increased imaging distance. 

In other work we are currently investigating mechanisms for scale selection as a 
pre-processing step [18]. Although it might still be necessary to store models at multiple 
characteristic scales, this number should still be smaller than with the pure-learning 
approach. This would reduce storage requirements, and also the recognition time. 

A possible reason for sandpaper proving so hard to recognise in the experiments 
reported in Fig. 5a, is that the representation in terms of filters blurs the information 
too much with this kind of salt-and-pepper structure. Indeed, the role of filter banks has 
recently been questioned, and other representations have proved effective [11,32,19]. 
Thus we intend to explore such descriptors in our future work. 
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Abstract. The left ventricle myocardium and chamber segmentation in 
gated SPECT images is a challenging problem. Segmentation is however 
the first step to geometry reconstruction and quantitative measurements 
needed for clinical parameters extraction from the images. New algo- 
rithms for segmenting the heart left ventricle myocardium and chamber 
are proposed. The accuracy of the volumes measured from the geomet- 
rical models used for segmentation is evaluated using simulated images. 
The error on the computed ejection fraction is low enough for diagnosis 
assistance. Experiments on real images are shown. 



1 Introduction 

The Left Ventricle (LV) myocardium accurate segmentation in gated SPECT 
(Single Photon Emission Computed Tomography) images is a challenging prob- 
lem due to the high level of noise and the signal drops resulting of insufficiently 
perfused regions. The LV chamber automated segmentation is even more difficult 
as the upper bound of the ventricle does not appear in the images. However, the 
accurate segmentation of the LV myocardium and chamber is very important 
for the estimation of the heart wall thickness and the chamber volume variation 
during the heart cycle. These parameters are needed to estimate clinically well 
established diagnosis indicators such as the ejection fraction. 

In this paper, we propose an implicit model-based segmentation algorithm of 
the LV myocardium and chamber. Our model is guided by the need of accuracy 
for volumes quantitative estimation. Indeed, the coarse spatial and temporal 
resolution of gated SPECT images causes large partial volume effects that can 
significantly alter the volume estimation results. This paper follows an earlier 
study on levelset-based segmentation of gated SPECT image [4]. 

To model objects and segment images, both explicit [20] and implicit [16,2] 
deformable models have been proposed in the literature [15]. The levelset has 
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been widely used in segmentation [17,3,5], medical images segmentation [13,18, 
8,11,14], heart segmentation [9] and SPECT images segmentation [7]. Some are 
taken into account shape priors [6]. Contrarily to many earlier approaches, our 
algorithm is taking into account a complete heart cycle sequence rather than 
processing volume frames independently. It is therefore better able to filter the 
image noise and to take into account temporal partial volume effects. 

2 Segmentation Model 

2.1 LV Myocardium Model 

The LV myocardium is modeled using a levelset-based method. The levelset 
provides a geometrical representation of the LV as well as a deformation process 
needed for extracting the myocardium shape from the image. In the levelset 
framework, a surface model S is implicitly represented as the 0 isosurface of a 
higher dimension function u. S deforms when u evolves according to an evolutive 
equation. Most evolution criteria found in the literature are spatial [12]. In the 
case of dynamic sequences, we prefer the Debreuve et al criterion [7]: 

du 

= {Kn{In — ~ ^out{B — InY' + ^cKn) ||Vu„j] (1) 

where In represents the image at instant n. The whole sequence is used in order 
to filter noise and determine the mean background intensity B, reestimated at 
each iteration. k„ is the curvature at instant n and /im„ the mean of image n 
internal part, also reestimated at each iteration from the zero level of Am, 
Aout, and Ac are weight parameters. This criterion makes the hypothesis that the 
image is composed of a uniform intensity region (the object to segment) and a 
background B. This approximation is only roughly valid for SPECT images due 
to the image noise, the inhomogeneity of the heart and the perfusion defaults 
causing signal drops. A forward Euler based on finite differences is used for PDE 
resolution. We note that there is no relation between iteration time (t) and 
physical time (n). 

2.2 LV Chamber Model 

The LV chamber surface is delimited by the myocardium inner boundaries and 
the valves plane on top. However, the LV myocardium has a U shape opened on 
top in gated SPECT images and the valves are not visible. A method to enclose 
the chamber volume is therefore needed. 



User Guided Methods. Some manual or semi-manual methods have been 
proposed in the literature. In [7], the authors manually set the two planes loca- 
tion. The result of this method is very user dependent. Faber, Cooke et al [10] 
approximate the LV valves by two fixed planes (see left of figure 1). The loca- 
tion and orientation of the two planes were empirically fixed on a dataset and is 
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merely valid for a given acquisition protocol. Moreover, it is difficult to ensure 
that the chamber volume is always closed by the two planes: the myocardium 
upper part is irregular and holes are likely to appear between the myocardium 
boundaries and the planes. 



muscle 

cavity 

membrane 




Fig. 1. Left: manual LV chamber closure. Center: membrane algorithm. Right: seg- 
mentation error inducing a volume estimation error. 



Membrane Algorithm To face the difficulty to accurately close the ventricle 
using planes, a new convex envelope algorithm was developed. This membrane 
method, depicted in center of figure 1, is completely automatic. The membrane is 
a deformable surface initialized from the result of the myocardium segmentation 
and deformed using the following evolution equation: 

I 

_ = (Ai/2 + A2«)||Vu|| (2) 

where / is a binary image resulting from the myocardium segmentation. The 
drawback of the membrane method is its sensitivity to the correct LV segmen- 
tation: for example when the visible bright region shape is not a U-shape. The 
membrane encloses the outliers and the inside volume is poorly estimated as 
illustrated in right of figure 1. 

Once the membrane has been deformed, the LV chamber is obtained by 
binary image processing: a binary myocardium image is produced from the my- 
ocardium segmentation and the chamber is filled up to the membrane boundary. 
An isosurface of the resulting inner volume is computed as illustrated in figure 9. 
From the LV chamber volume, we can estimate the heart ejection fraction (EF). 
The EF is computed as the ratio between the volume of blood ejected at each 
heart beat (the difference of volumes between the chamber at end of dilation 
phase, or diastole, and contraction phase, or systole) over the chamber maximal 
volume: EF = {Vd — Vs)/Vd x 100% where Vd and Vs are the end of diastole and 
end of systole volumes, respectively. 

2.3 Challenging the Homogeneous Intensity Region Hypothesis 

The criterion 1 used in this study is based on the hypothesis that the image is 
composed of an homogeneous object on an homogeneous background. This is 
only roughly true in real images due to two partial volume effects and temporal 
blurring. 
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Partial Volume Effects. The partial volume effect is responsible for myocardium 
intensity variations during the cardiac cycle: the myocardium appears brighter 
at end of systole and darker at end of diastole. When the thickness of the my- 
ocardium is only a few voxels wide (at the end of diastole), many voxels do 
not contain only myocardium but also part of the outside region, lowering their 
intensity. Conversely, at end of systole, the thickening of the muscle leads to 
brighter muscle voxels. This artefact is used as an index of wall thickenning [1]. 
We can observe this phenomena on figure 2. 




Fig. 2. Time frames 0 to 7, from left to right and top to bottom, showing one short 
axis slice. The intensity is higher in frames 2 and 3 (end of systole) than in the others. 



Temporal Blurring. Due to the images reconstruction process, a blurring appears 
in the image sequences. This temporal blurring is mostly visible at the base (top) 
of the myocardium while the more static apex part is unaffected. The visual 
consequences are that (i) during diastole, extremities of the ventricle muscle are 
darker than the apex and (ii) during systole, borders of the muscle near the apex 
are darker. 

Consequences of Segmentation Errors on Volume Estimation. Partial volume 
effects and temporal blurring combine their effects, leading to different segmen- 
tation errors during the systole and diastole phases. At end of diastole, the 
myocardium extremities are darker and tend to be truncated. The temporal 
blurring, will also cause the myocardium to appear slightly thicker than it is in 
reality. This leads to underestimating the chamber volume. Conversely at end 
of systole, the myocardium extremities are overestimated while the myocardium 
appears slightly thinner than it should be. This leads to overestimating the 
chamber volume. The EF estimation is significantly affected by the combination 
of these segmentation errors. Figure 4 shows the erroneous estimated volumes 
using the segmentation algorithm for different weights of the internal force term 
Xin. Although they appear visually insignificant, these segmentation errors have 
a drastic impact on the volume estimations. Due to the coarse resolution of gated 
SPECT images and the small size of the LV, even an error of only one voxel all 
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along the LV chamber surface leads to an error of about 50% of the chamber 
volume, making the computed EF absolutely meaningless. 

3 Segmentation and Quantification Experiments 

To validate the algorithm accuracy, experiments on simulated images were first 
performed. With simulated images, a ground truth (the actual volumes of the 
virtual objects used for simulation) is known and the algorithm can be quan- 
titatively evaluated. Experiments were then lead on real images for which no 
ground truth is available. 

3.1 Experiments on Simulated Images 

Simulating Images Using the NCAT Phantom. W.P. Segars [19] has de- 
veloped a four-dimensional NURBS-based CArdiac- Torso (NCAT) phantom for 
simulating nuclear medicine images. The organ models are based on non-uniform 
rational B-splines which define continuous surfaces. The phantom can thus be 
used at any spatial resolution. An important innovation is the extension of 
NURBS to a fourth dimension, time, to model the cardiac beat and the res- 
piratory motion. Given a model of the physics of the nuclear imaging process, 
simulated images of the numerical phantom can be computed by the NCAT 
simulator. The main advantage of using computerized organ models in medical 
studies is that the exact anatomy and physiological functions of the phantom 
are known, thus providing a gold standard against which the image processing 
and reconstruction algorithms can be evaluated quantitatively. 

Volume Estimation. Figure 3 shows an example of volume estimation after 
segmentation of an image produced by the NCAT simulator and extraction of 
the chamber by the membrane algorithm. The volume estimation error is small 
compared to the spatial resolution of the simulated images (less than 6% in the 
worst case) and the error on the computed EF is lower than 2%. The membrane 
algorithm therefore estimates a realistic closure of the LV boundary. 




Fig. 3. Comparison of the LV chamber volume estimation on simulated NCAT images 
against the ground truth. 
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Realistic Images. Raw images produced by the NCAT simulator are not re- 
alistic since they are not noisy and they do not introduce temporal blurring 
as described in section 2.3. For further evaluating our algorithm, an artificial 
temporal blurring was introduced by convolving the longitudinal sequences with 
a Gaussian kernel in the time direction (cr = 6), and a spatial Gaussian noise 
(cr = 4) was added in each image. Straight segmentation of blurred and noisy 
images is not satisfying. A high level of noise requires increasing the internal 
force weight. However, this also causes less precise location of the myocardium 
boundaries. 

Different internal weight values have been tested in the criterion 1. Fixing 
Aout = 1 and Ac = 1, figure 5 shows the segmentation results for different 
values of Ajj^. For low values of the internal force weight (Ajj^ = 1 and Ajj^ = 2), 
the region near the apex is poorly segmented: the myocardium surface is too 
thick. For a higher value (Ajj^ = 3), the thickness is correct but a significant 
part of the extremities is truncated. A small part of the extremities is also 
truncated for Ajj^ = 1, due to the temporal blurring (for better visualization, 
we superimposed the segmentation results on the original NGAT images but the 
segmentation is computed on blurred and noisy images). Figure 4 shows that 
the estimated volume of the LV chamber is indeed under-evaluated except at 
the end of systole. Both myocardium extremities troncature (for high values of 
Xin) and myocardium thickness overestimates (for low values of Ai„) lead to 
underestimating the chamber volume. 




Fig. 4. LV chamber volume after segmentation of NCAT simulated data for different 
values of Ajjj and ground truth. 



3.2 A New Adaptive Algorithm 

Since the accurate segmentation of the different parts of the myocardium requires 
different tunings of the relative weights of the internal and external objects and 
no satisfying trade-off can be found for the complete image, we propose an 
adaptive algorithm described by the following steps: 

~ Normal segmentation of the end of systole volume. The temporal 
blurring is minimum and the myocardium intensity is maximum at end of 
systole. 
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Fig. 5. Segmentation results for Ajj^ = 1 (top), Ajj^ = 2 (center), and Ajj^ = 3 (bottom). 

LV barycenter and principal axis estimation. The principal axis com- 
puted from the segmented image roughly corresponds to the heart long axis. 

— Image volume splitting. The volume space is split by several short axis 
planes. Different parameters can be attributed to each space region. 

Locating the Heart Long Axis. The end of systole volume presents the lowest tem- 
poral blurring and the highest myocardium contrast. It is therefore the easiest 
frame to segment. Moreover, this stage is not very sensitive to small segmenta- 
tion errors. The myocardium is first extracted in this frame. The resulting model 
is used to produce a binary image. Only the largest connex component is kept 
from this image to remove outliers. The LV barycenter (x, y) and the principal 
axis are estimated. The principal axis is the eigenvector corresponding to the 
highest eigenvalue of the inertia matrix: 

^ ^ mil mil ^ ~ y^ 

Estimation of the Different Planes. The image volume is split by planes normal 
to the heart long axis estimated in the previous step. The first volume region 
contains the heart apex. This region is delimited by a plane orthogonal to the 
principal axis and close enough to the LV barycenter to cut the myocardium 
extremities segmented at the end of systole with Ai„ = 1 (see figure 6). The last 
region will fall outside the ventricle, beyond the myocardium extremities. It is 
determined by a plane parallel to the first one, and outside the segmentation 
obtained at the end of diastole with Ai„ = 3. Two other planes, equally spaced 
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Fig. 6. Splitting the image volume by planes normal to the heart long axis. 



between the two previous ones, finish to split the image volume in 5 regions (see 
figure 6). 

Deformation with Variable Weights. Once the splitting planes have been located, 
all frames of the cardiac sequence are segmented. Different values are set for the 
\in weight in each frame and each volume region. The criterion equation 1 is 
modified for a new criterion with varying weights Ai„: 

dtiji 

~ ^out{D In) || |[ (4) 

where Si is a 3D region delimited by planes (see figure 6) : 5 'q corresponds to the 
region containing the apex, S 4 to the region beyond the base. 

The different weights were determined empirically on an image dataset. Fol- 
lowing the observations made in paragraph 3.1, we choose the Xi„ weights grow- 
ing from So to S 4 near the diastole, and growing from S 4 to Sq near the systole. 
The weights used for NCAT simulated sequences are shown in left of figure 7. 
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Fig. 7. Adaptive weights. On the left, values for NCAT sequences. On the rigth, values 
used for the real gated SPECT data. 
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Fig. 8. LV chamber volume variations along the cardiac cycle using NCAT data: real 
values and computed values. 



For other sequences, a different weights tuning might be needed. This pa- 
rameterization is constant for a given image acquisition protocol. Manual tuning 
is only needed when changing the acquisition device or protocol. 

Segmentation Results with the Adaptive Algorithm. The adaptive algorithm seg- 
mentation, leads to a good profile for the LV chamber volume variations during 
the cardiac cycle as shown in figure 8. The values are close to the real ones with 
an error of about 15 ml (about 13%). This is sufficient to compute accurate EF 
values (with an error of 8%). 



3.3 Experiments on Real Images from Healthy Patients 

Segmentation experiments were made on real images provided by the Centre 
Antoine Lacassagne nuclear medicine department in Nice. The images were ac- 
quired and filtered by a Butterworth low-pass filter before 3D reconstruction. 
The voxels dimension is 3.46 x 3.46 x 7.12 mm. 

Figure 9 shows a segmentation example. The myocardium segmentation (left), 
the convex envelope extracted by the membrane algorithm (center), and the LV 
chamber surface (right) are shown for 4 out of the 8 images of a complete se- 
quence. 

The weights used for are shown in right of figure 7. Figure 10 compares the 
evolution of the LV chamber volume obtained using basic segmentation and the 
adaptive algorithm. With the later method, the profile of the curve is improved 
and the EF value (75.5% instead of 56.5%) is more realistic. 

4 Conclusion 

The accurate estimation of the LV myocardium and chamber volumes is very 
sensitive to segmentation errors in gated SPECT images. In this paper, we pro- 
posed a novel adaptive algorithm taking into account the temporal nature of 
the image sequences to more precisely locate the heart wall boundaries. Our al- 
gorithm uses the whole image sequence to estimate the background intensity in 
the deformation criterion 1. Furthermore, the temporal blurring of the sequences 
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Fig. 9. LV chamber estimation: meshes (computed using Marching Cubes algorithm) 
of the myocardium segmentation (left), membrane with Gouraud shading (center), and 
chamber estimation with Gouraud shading (right). 




Fig. 10. Gomparison of LV chamber volume variations during the cardiac cycle, on 
real SPEGT data, from a healthy patient. 



is compensated through the spatial and temporal adaptation of the algorithm 
parameters. A membrane algorithm was developed to automatically extract the 
LV chamber from the myocardium segmentation. 

We could validate the accuracy of the method on simulated images. The er- 
ror in the LV chamber volume computation do not exceed 15 ml. The resulting 
variability of the EF is about 8% which is low enough for a practical use. First 
results on real images are encouraging although a clinical study is needed to com- 
pare the results to established gold standards. Setting the adaptive parameters 
automatically is highly desirable from the user point of view. Some work also 
need to be done in the case of pathological images showing severe signal drops 
due to myocardium perfusion defaults: holes then appear in the myocardium 
wall that need to be taken into account for volumes estimation. 
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Abstract. We describe a Markov chain Monte Carlo based particle fil- 
ter that effectively deals with interacting targets, i.e., targets that are 
influenced by the proximity and/or behavior of other targets. Such in- 
teractions cause problems for traditional approaches to the data associ- 
ation problem. In response, we developed a joint tracker that includes 
a more sophisticated motion model to maintain the identity of targets 
throughout an interaction, drastically reducing tracker failures. The pa- 
per presents two main contributions: (1) we show how a Markov random 
field (MRF) motion prior, built on the fly at each time step, can sub- 
stantially improve tracking when targets interact, and (2) we show how 
this can be done efficiently using Markov chain Monte Carlo (MCMC) 
sampling. We prove that incorporating an MRF to model interactions is 
equivalent to adding an additional interaction factor to the importance 
weights in a joint particle filter. Since a joint particle filter suffers from 
exponential complexity in the number of tracked targets, we replace the 
traditional importance sampling step in the particle filter with an MCMC 
sampling step. The resulting filter deals efficiently and effectively with 
complicated interactions when targets approach each other. We present 
both qualitative and quantitative results to substantiate the claims made 
in the paper, including a large scale experiment on a video-sequence of 
over 10,000 frames in length. 



1 Introduction 

This work is concerned with the problem of tracking multiple interacting targets. 
Our objective is to obtain a record of the trajectories of targets over time, and 
to maintain correct, unique identification of each target throughout. Tracking 
multiple identical targets becomes challenging when the targets pass close to 
one another or merge. 

The classical multi-target tracking literature approaches this problem by per- 
forming a data-association step after a detection step. Most notably, the multiple 
hypothesis tracker [1] and the joint probabilistic data association filter (JPDAF) 
[2] are influential algorithms in this class. These multi-target tracking algorithms 
have been used extensively in the context of computer vision. Some examples 
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are the use of nearest neighbor tracking in [3], the multiple hypothesis tracker 
in [4] , and the JPDAF in [5] . Recently, a particle filtering version of the JPDAF 
has been proposed in [6]. 

In this paper we address the problem of interacting targets, which causes 
problems for traditional approaches. Dealing appropriately with this problem 
has important implications for vision-based tracking of animals, and is generally 
applicable to any situation where many interacting targets need to be tracked 
over time. Visual animal tracking is not an artificial task: it has countless appli- 
cations in biology and medicine. In addition, our long term research goals involve 
the analysis of multi-agent system behavior in general, with social insects as a 
model [7]. The domain offers many challenges that are quite different from the 
typical radar tracking domain in which most multi-target tracking algorithms 
are evaluated. 

In contrast to traditional methods, our approach relies on the use of a more 
capable motion model, one that is able to adequately describe target behavior 
throughout an interaction event. The basic assumption on which all established 
data-association methods rely is that targets maintain their behavior before and 
after the targets visually merge. However, consider the example in Figure 1, 
which shows 20 ants being tracked in a small arena. In this case, the targets do 
not behave independently: whenever one ant encounters another, some amount 
of interaction takes place, and the behavior of a given ant before and after an 
interaction can be quite different. The approach we propose is to have the motion 
model reflect this additional complexity of the target behavior. 

The first contribution of this paper is to show how a Markov random held 
motion prior, built on the fly at each time step, can adequately model these 
interactions and defeat these failure modes. Our approach is based on the well 
known particle Alter [8,9], a multi-hypothesis tracker that uses a set of weighted 
particles to approximate a density function corresponding to the probability of 
the location of the target given observations over time. The standard particle 




Fig. 1. 20 ants are being tracked by an MCMC-based particle filter. Targets do not 
behave independently: whenever one ant encounters another, some amount of interac- 
tion takes place, and the behavior of a given ant before and after an interaction can be 
quite different. This observation is generally applicable to any situation where many 
interacting targets need to be tracked over time. 
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(a) frame 9043 (b) frame 9080 (c)frame 9083 



Fig. 2. (a) Three interacting ants are being tracked using independent particle filters, 
(b) The target with the best likelihood score typically “hijacks” the filters of nearby 
targets, (c) Resulting tracker failure. We address this problem using an Markov random 
field motion prior, built on the fly at each time step, that can adequately model these 
interactions and defeat these failure modes. 



filter weights particles based on a likelihood score, and then propagates these 
weighted particles according to a motion model. Simply running multiple particle 
filters, however, is not a viable option: whenever targets pass close to one another, 
the target with the best likelihood score typically “hijacks” the filters of nearby 
targets, as is illustrated in Figure 2. In these cases, identity could be maintained 
during tracking by providing a more complex motion model that approximates 
the interaction between targets. We show below that incorporating an MRF to 
model interactions is equivalent to adding an additional interaction factor to the 
importance weights in a joint particle filter. 

The second contribution is to show how this can be done efficiently using 
Markov chain Monte Carlo (MCMC) sampling. The joint particle filter suffers 
from exponential complexity in the number of tracked targets, n. Computational 
requirements render the joint filter unusable for more than than three or four 
targets [10]. As a solution, we replace the traditional importance sampling step in 
the particle filter with an MCMC sampling step. This approach has the appealing 
property that the filter behaves as a set of individual particle filters when the 
targets are not interacting, but efficiently deals with complicated interactions 
when targets approach each other. The idea of using MCMC in the sequential 
importance resampling (SIR) particle filter scheme has been explored before, 
in [11]. Our approach can be consider a specialization of this work with an 
MRF-based joint posterior and an efficient proposal step to achieve reasonable 
performance. 

In other related work, MCMC has been used in different ways in a particle 
filter setting. [12,13] introduce periodic MCMC steps to diversify particles in a 
fixed-lag smoothing scheme. Similarly, Marthi et. al. [14] developed “Decayed 
MCMC” sequential Monte Carlo, in which they focus the sampling activity of 
the MCMC sampler to state variables in the recent past. 

Finally, several other particle- filter based approaches exist to tracking mul- 
tiple identical targets. [15] “binds” particles to specific targets. [16] uses parti- 
tioned sampling and a probabilistic exclusion principle, which adds a term to the 
measurement model that assures that every feature measured belongs to only 
one target. BraMBLe [17] addresses tracking and initializing multiple targets 
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in a variable-dimension framework. However, all of these are joint particle filter 
approaches and are less suitable to tracking a large number of targets. 



2 Bayesian Multi-target Tracking 

The multiple target tracking problem can be expressed as a Bayes filter. We 
recursively update the posterior distribution P{Xt\Z^) over the joint state of 
the all n targets {Xn\i G l..n} given all observations Z* = {Zi..Zt} up to and 
including time t, according to: 

P{Xt\Z^) = kP{Zt\Xt) [ P{Xt\Xt_,)P{Xt_,\Z^-^) (1) 

JXt-1 

The likelihood P{Zt\Xt) expresses the measurement model, the probability we 
observed the measurement Zt given the state Xt at time t. The motion model 
P{Xt\Xt-i) predicts the state Xt&i time t given the previous state Xt-\. In 
all that follows we will assume that the likelihood P{Zt\Xt) factors as across 
targets as P{Zt\Xt) = nr=i and that the appearances of targets are 

conditionally independent. 



2.1 Independent Particle Filters 



When identical targets do not interact, we can approximate the exact Bayes 
filter by running multiple single-target particle filters. Mathematically, this is 
equivalent to factoring the motion model P{Xt\Xt-i) as Y\^P{Xit\Xi^t-i)- 
For each of the n independent filters, we need to approximate the posterior 
P{Xit\Z'^) over each target’s state X^. A particle filter can be viewed as an 
importance sampler for this posterior P{Xit\Z*), using the predictive density on 
the state Xu as the proposal distribution. Briefly, one inductively assumes that 
the posterior at the previous time step is approximated by a set of weighted 
particles 



P{Xu\Z^-^) 






1 ’ 



^(0 \N 



(s) 

Then, for the current time-step, we draw N samples X^^ ’ from a proposal dis- 
tribution 



~ q{Xu) = VttW 



r) 



which is a mixture of motion models P{Xit\x\'^^_f). Then we weight each sample 
so obtained by its likelihood given the measurement Zu, i.e. 

= P{Zu\x\t'^) 

This results in a weighted particle approximation for the pos- 

terior P{Xit\Z*) over the target’s state Xit at time t. There are other ways to 
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explain the particle filter (see e.g. [18]) that more easily accommodate other 
variants, but the mixture proposal view above is particularly suited for our ap- 
plication. 

While using independent filters is computationally tractable, the result is 
prone to frequent failures. Each particle filter samples in a small space, and the 
resulting “joint” filter’s complexity is linear in the number of targets, n. However, 
in cases where targets do interact, as in an insect tracking scenario, single particle 
filters are susceptible to failures exactly when interactions occur. In a typical 
failure mode, illustrated in Figure 2, several trackers will start tracking the 
single target with the highest likelihood score. 

3 MRF Motion Model 

Our approach to addressing tracker failures resulting from interactions is to in- 
troduce a more capable motion model, based on Markov random fields (MRFs). 
We model the interaction between targets using a graph-based MRF constructed 
on the fly for each individual time-step. An MRF is a graph (V, E) with undi- 
rected edges between nodes where the joint probability is factored as a product 
of local potential functions at each node, and interactions are defined on neigh- 
borhood cliques. See [19] for a thorough exposition. The most commonly used 
form is a pairwise MRF, where the cliques are pairs of nodes that are connected 
in the undirected graph. We assume the following pairwise MRF form, where 
the ip(Xit, Xjt) are pairwise interaction potentials: 

P{Xt\Xt-i) cx n (2) 

i ij^E 

The interaction potentials of the MRF afford us the possibility of easily spec- 
ifying domain knowledge governing the joint behavior of interacting targets. At 
the same time, the absence of an edge in the MRF encodes the domain knowl- 
edge that targets do not influence each other’s behavior. As a concrete example, 
in the insect tracking application we present in the Section 6, we know that two 
insects rarely occupy the same space. Taking advantage of this assumption can 
help greatly in tracking two targets that pass close to one another. An example 
MRF for our test domain is illustrated in Figure 3; in this case, targets within 
64 pixels (about 2 cm) of one another are linked by MRF edges. The absence of 
edges between two ants provides mathematical rigor to the intuition that ants 
far away will not influence each other’s motion. 

Since it is easier to specify the interaction potential in the log domain, we 
express tp{Xit,Xjt) by means of the Gibbs distribution: 

il){Xit, Xjt) oc exp {-g{Xit, Xjt)) (3) 

where g{Xit, Xjt) is a penalty function. For example, in the ant tracking appli- 
cation the penalty function g{Xn, Xjt) we use depends only on the number of 
pixels overlap between the target boxes of two targets. It is maximal when two 
targets coincide and gradually falls off as targets move apart. 
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Fig. 3. To model interactions, we dynamically construct a Markov random field at 
each time step, with edges for targets that are close to one another. An example is 
shown here for 6 ants. Targets that are far from one another are not linked by an edge, 
reflecting that there is no interaction. 



4 The Joint MRF Particle Filter 

The MRF terms that model interactions can be incorporated into the Bayes 
filter in a straightforward manner, but now we are forced to consider the full 
joint state of all n targets. In particular, analogous to the single target filter 
explained in Section 2.1, we recursively approximate the posterior on the joint 
state Ait as a set of N weighted samples, obtaining the following Monte Carlo 
approximation to the Bayes filter (1): 

P(Xt|Z‘) fcP(ZtlXt) ^ 4\P{X,\x['-\) (4) 

r 

We can easily plug in the MRF motion model (2) into the joint particle filter 
equation (4). Note that the interaction potential (3) does not depend on the 
previous target state Xt-i, and hence the target distribution (4) for the joint 
MRF filter factors as 

p{xt\z^) ^ kp{zt\xt) n n (5) 

ijGE r i 

In other words, the interaction term moves out of the mixture distribution. This 
means that we can simply treat the interaction term as an additional factor 
in the importance weight. In other words, we sample from the joint proposal 
distribution function 

xt^ ^ q(X,) = 

r i 

and weight the samples according to the following factored likelihood expression: 

TTp) = f[p{zu\x^^p) n 
2 — 1 

However, the joint particle filter approximation is not well suited for multi- 

(s') 

target tracking. Each particle contains the joint position of all n targets, Xj ' = 
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and the filter suffers from exponential complexity in the number 
of tracked targets, n. If too few particles are used, all but a few importance 
weights will be near-zero. In other words, the Monte Carlo approximation (4), 
while asymptotically unbiased, will have high variance. These considerations 
render the joint filter unusable in practice for more than than three or four 
targets [10]. 

5 The MCMC-Based MRF Particle Filter 

The second contribution of this paper is to show how that we can efficiently 
sample from the factored target posterior distribution (5) using Markov chain 
Monte Carlo (MCMC) sampling [20,21,22]. In effect, we are replacing the inef- 
ficient importance sampling step with an efficient MCMC sampling step. 

All MCMC methods work by generating a sequence of states, in our case 
joint target configurations Xt at time t, with the property that the collection 
of generated states approximates a sample from the target distribution (5). To 
accomplish this, a Markov chain is defined over the space of configurations Xt 
such that the stationary distribution of the chain is exactly the target distribu- 
tion. The Metropolis-Hastings (MH) algorithm [23] is a way to simulate from 
such a chain. We use it to generate a sequence of samples from P{Xt\Z*). 



5.1 Proposal Density 

The key to the efficiency of this sampler rests in the specific proposal den- 
sity we use. In particular, we only change the state of one target at a time 
by sampling directly from the factored motion model of the selected target 
Q{X't\Xt) = j^Q{X't\Xt, f) = ^ E. 5{X'^t = Xi). Each tar- 

get is equally likely to be selected. The acceptance ratio for this proposal can be 
calculated very efficiently, as only the likelihood and MRF interaction potential 
for the chosen target need to be evaluated: 

. / P{Zt\x[t)Y{,^Et^{x[„x'^,)\ 

as mm j 

This also has the desirable consequence that, if targets do not interact, the 
MCMC-based filter above is just as efficient as multiple, independent particle 
filters. 

5.2 Algorithm Summary 

In summary, the detailed steps of the MCMC-based tracking algorithm we pro- 
pose are: 

1. At time t — 1 the state of the targets is represented by an set of samples 
{AjEilEi ®^ch containing the joint state X^\ = {X^^_^y . . . , 
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2. Initialize the MCMC sampler for time t by drawing Xt from the interaction- 
free predictive density Hi 

1. Perform Metropolis-Hastings iterations to obtain M samples from the fac- 
tored posterior (5). Discard the first B samples to account for sampler burn- 
in. In detail: 

a) Proposal step: 

(r) 

i. Randomly select a joint sample X^_\ from the set of unweighted 
samples from the previous time step. 

ii. Randomly select a target i from n targets. This will be the target 
that we propose to move. 

iii. Using the previous state of this target sample from the 

conditionally dependent motion model P{Xlf.\Xy^_y) to obtain X-^. 

b) Compute the acceptance ratio: 

as min Xjt) ) 

c) If Os > 1 then accept X-^ , set the the ith target in Xt to . Otherwise, 
we accept it with probability as- If rejected, we leave the zth target in 
Xt unchanged. Add a copy of the current Xt to the new sample set. 

2. The sample set at time t represents an estimated joint state of 

the targets. 

6 Experimental Validation 

We evaluated our approach by tracking through a very long video-sequence of 
roaming ants, and present both quantitative results as well as a graphical com- 
parison of the different tracker methodologies. The test sequence consists of 
10,400 720 by 480 24-bit RGB frames at 30 Hz of 20 ants, roaming about an 
arena. The ants themselves are about 1 cm long and move about the arena as 
quickly as 3 cm per second. Interactions occur frequently and can involve 5 or 
more ants in close proximity. In these cases, the motion of the animals is diffi- 
cult to predict. After pausing and touching one another, they often walk rapidly 
sideways or even backward. This experimental domain provides a substantial 
challenge to any multi-target tracker. 



6.1 Experimental Details and Results 

We evaluated a number of different trackers with respect to a baseline “pseudo 
ground truth” sequence. As no ground truth was available we obtained the base- 
line sequence by running a slow but accurate tracker and correcting any mistakes 
it made by hand. In particular, we ran our MCMC tracker with 2000 samples, 
which lost track only 15 times in the entire sequence. When we observed a 
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tracker failure, we reinitialized by hand the positions of the targets and resumed 
tracking. 

Below are the specific implementation choices we made to specialize the gen- 
eral algorithm of Section 5.2 to the ant tracking application: 

— The state Xu of the ant i is its position {xit,yit) and orientation 9^. 

— For the likelihood model we used an appearance template approach with 

a robust error norm. In particular, we use a 10 by 32 pixel rectangular 
template containing a mean appearance image yLp and a standard deviation 
image ap, both estimated from 149 manually selected ant images. We also 
learned a background mean image hb and standard deviation image cfb 
from 10,000 randomly selected pixels. The log-likelihood is then calculated 
as \ogP{Xu\Zt) = Here F{Xu) is the vector 

of pixels from a target with state Xu after translation and rotation to the 
template coordinate frame. 

— For the motion model we used a normal density centered on the previous 

pose Xt\Xf_i = R{9t-i+A9)[Ax Ay 0]^-|-Ari_i where [Ax,Ay,A9] ~ 
[A^(0,(j^),iV(0,cr2),fV(0,CT^)] with ere) = (3.0, 5.0, 0.4). 

— For the MRF interaction terms we used a simple linear interaction function 
7 P where p is the area of overlap between two targets and 7 = 5000. 

— MCMC parameters: we discard 25% of the samples to let the sampler burn 
in, regardless of the total number of samples. 

Table 1 shows the number of tracking failures for all the tracker/sample size 
combinations we evaluated. We automatically identified failures of these track- 
ers when the reported position of a target deviated 50 pixels from the pseudo 
ground truth position. This allowed us to detect switched and lost targets with- 
out manual intervention. 

Figure 4 shows the result graphically, comparing 3 different samplers, each 
with an equivalent sample size of 1000. For each of the trackers, we show exactly 
where failures occur throughout the sequences by tick-marks. To obtain a mea- 
sure of trajectory quality, we also recorded for each frame the average distance 



Table 1. Tracker failures observed in the 10,400 frame test sequence 



Tracker 


Number of Samples Number of Failures 


MCMC 


50 


123 


MCMC 


100 


49 


MCMC 


200 


28 


MCMC 


1000 


16 


single particle filter 


10 per target 


148 


single particle filter 


50 per target 


125 


single particle filter 


100 per target 


119 


joint particle filter 


50 


544 


joint particle filter 


100 


519 


joint particle filter 


200 


479 


joint particle filter 


1000 


392 
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Fig. 4. (a-c): Qualitative comparison of 3 trackers, each tracking 20 ants using an equiv- 
alent sample size of 1000. Tick marks show when tracking failures occur throughout 
the sequence. The time series plot shows average distance from ground truth (averaged 
per target and per second) 



of the targets to their ground truth trajectories. This is shown in the figure as a 
time series, for each tracker, averaged per second time unit. 

6.2 Discussion 

From the quantitative results in Table 1 and the qualitative comparison in Fig- 
ure 4 we draw the following conclusions: 

1. The joint filter is clearly unusable for tracking this many targets. The track 
quality is very low and number of errors reported is very high. 

2. The MCMC-based trackers perform significantly better than independent 
particle filters with a comparable number of samples, both in track quality 
and failures reported. For example, both MCMC trackers with 1000 samples 
had only 16 failures, as compared to 125 for 20 independent particle filters 
with 50 particles each. 

3. To our surprise, an MCMC-based tracker with only 50 samples total per- 
formed as well as or better than 20 independent particle filters with 50 
samples each (1000 samples total). 

4. The MCMC-based trackers rapidly improve their performance as we increase 
the number of samples. The number of failures falls from 123 to 16 as the 
number of samples is increased from 50 to 1000. Such an effect is not seen for 
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(a) frame 3054 (b) frame 3054 (c) frame 3072 



Fig. 5. Typical failure modes of MCMC-based MRF particle filter occur when the as- 
sumption that targets do not overlap is violated, (a) Two targets undergoing extensive 
overlap, (b) The tracker reports the incorrect position for the overlapped ant. (c) The 
resulting tracker failure. 



an equivalent increase in computation for the single particle filters, because 
in that case increasing the number of samples does not improve the ability 
to deal with ant interactions. 

7 Conclusions 

In conclusion, the MCMC-MRF approach proposed in the paper has significantly 
improved the tracking of multiple interacting targets. Figure 5 shows that for 
the insect tracking case, the few remaining tracking failures that remain for the 
MCMC-based tracker occur when our assumption that targets do not overlap is 
violated. In these cases, it is unclear that any data-association method offers a 
solution. A more complicated joint likelihood model might be helpful in these 
cases. 

In future work, we intend to validate the approach proposed here by tracking 
hundreds of interacting targets. Our long term research goals involve the analysis 
of multi-agent system behavior in general, with social insects as a model [7]. In 
particular, we are looking at honey bees in an active bee hive as a challenging 
test for multi-target tracking. Finally, it is our hope that the MRF-based motion 
model and its efficient sequential implementation using MCMC will benefit other 
application domains besides vision-based animal tracking, for which it has clearly 
been shown to be useful. 
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Abstract. A model of human appearance is presented for efficient pose 
estimation from real-world images. In common with related approaches, a 
high-level model defines a space of configurations which can be associated 
with image measurements and thus scored. A search is performed to 
identify good configuration(s). Such an approach is challenging because 
the configuration space is high dimensional, the search is global, and the 
appearance of humans in images is complex due to background clutter, 
shape uncertainty and texture. 

The system presented here is novel in several respects. The formulation 
allows differing numbers of parts to be parameterised and allows poses 
of differing dimensionality to be compared in a principled manner based 
upon learnt likelihood ratios. In contrast with current approaches, this 
allows a part based search in the presence of self occlusion. Furthermore, 
it provides a principled automatic approach to other object occlusion. 
View based probabilistic models of body part shapes are learnt that rep- 
resent intra and inter person variability (in contrast to rigid geometric 
primitives). The probabilistic region for each part is transformed into the 
image using the configuration hypothesis and used to collect two appear- 
ance distributions for the part’s foreground and adjacent background. 
Likelihood ratios for single parts are learnt from the dissimilarity of the 
foreground and adjacent background appearance distributions. It is im- 
portant to note the distinction between this technique and restrictive 
foreground/background specific modelling. It is demonstrated that this 
likelihood allows better discrimination of body parts in real world images 
than contour to edge matching techniques. Furthermore, the likelihood 
is less sparse and noisy, making coarse sampling and local search more 
effective. A likelihood ratio for body part pairs with similar appearances 
is also learnt. Together with a model of inter-part distances this better 
describes correct higher dimensional configurations. Results from apply- 
ing an optimization scheme to the likelihood model for challenging real 
world images are presented. 
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1 Introduction 

It is popular in the literature to match a high-level shape model to an image in 
order to recover human pose (see the review papers [1,2]). Samples are drawn 
from the shape configuration space to search for a good match. The success of 
this approach, in terms of its accuracy and efficiency, depends critically on the 
choice of likelihood formulation and its implicit assumptions. This paper presents 
a strong likelihood model and a flexible, effectively low dimensional formulation 
that allows efficient inference of detailed pose from real-world images. Pose es- 
timation is performed here from single colour images so no motion information 
is available. This method could however form an important component in an 
automatically (re) initialising human tracker. 

1.1 Assumptions 

Estimation of human body pose from poorly constrained scenes is made difficult 
by the large variation in human appearance. The system presented here aims 
to recover the variation due to body pose automatically and efficiently in the 
presence of other variations due to: 

— unknown subject identity, clothing colour and texture 

— unknown, significantly cluttered, indoor or outdoor scenes 

— uncontrolled illumination 

— general, other object occlusion 

It is assumed that perspective effects are weak and that the scale is such 
that distributions of pixel values or local features can be estimated and used 
to characterise body parts. It is further assumed that the class of view point is 
known, in this case a side on view. These assumptions apply to a large proportion 
of real world photographs of people. 



1.2 Formulation 

There are two main approaches to human pose estimation. The ‘top-down’ ap- 
proach makes samples in a high dimensional space and fully models self-occlusion 
(e.g. [3,4, 5, 6]). It does not incorporate bottom-up part identification and is inap- 
propriate without a strong pose prior (and is therefore mostly used in trackers). 
The ‘bottom-up’ approach identifies the body parts and then assembles them 
into the best configuration. Whilst it does sample globally it does not model 
self-occlusion. Both approaches tend to rely on a fixed number of parts be- 
ing parameterised (a notable exception being the recent work of Ramanan and 
Forsyth [7]). However, occlusion by other objects or weak evidence may make 
some parts unidentifiable. The approach of partial configurations presented here 
bridges these two approaches by allowing configurations of different dimensional- 
ities to be compared. This is done by combining learnt likelihood ratios computed 
only from the parameterised, visible parts. The method has several advantages. 




Human Pose Estimation 



293 



Firstly, it allows general occlusion conditions to be handled. Secondly, it makes 
use of the fact that some parts might be found more easily than others. For 
example, it is often easier to locate parts that do not overlap. Thirdly, it makes 
use of the fact that configurations with small numbers of parts contain much of 
the overall pose information because of inter-part linking. For example, knowing 
the position of just the head and outer limbs greatly constrains the overall pose. 
The approach of partial configurations, along with a global stochastic optimiza- 
tion scheme, is more flexible than pictorial structures [8] since it allows a large 
range of occlusion conditions. When employed in a time-constrained optimiza- 
tion scheme, it allows the system to report lower dimensional solution(s) should 
a higher dimensional one not be found in time. A consequence of the formulation 
is that parts must be parameterised in their own co-ordinate system rather than 
hierarchically as is often the case in tracking systems, e.g. [3]. Whilst this might 
appear to increase the dimensionality of the pose parameter space, in practice 
an offset term is often required to model complex joints like the shoulder [6] 
making the difference one of mathematical convenience. 

1.3 Outline 

The remainder of the paper details the three components that make up the 
likelihood ratio used to find humans in real images. For ease of exposition. Sec- 
tion 2 begins by describing the likelihood ratio used to find single body parts. 
A probabilistic region template is transformed into image space and used to 
estimate foreground and adjacent background appearances. The hypothesised 
foreground and background appearances are compared and a likelihood ratio is 
computed, based upon learnt PDFs of the similarity for on-part responses and 
off-part responses. The performance of this technique is then demonstrated and 
compared to a competing method. Section 3.2 presents a method for compar- 
ing hypothesised pose configurations incorporating inter-part joint constraints 
in which subsets of the body parts are instantiated. Section 3.3 then introduces 
a constraint based on the a priori expectation that pairs of parts will have sim- 
ilar appearance. Finally, pose estimation results are presented and conclusions 
drawn. 

2 Finding Single Parts Using Probabilistic Regions 

The model of body parts proposed here provides an efficient mechanism for 
the evaluation of hypothesised body parts in everyday scenes due to a highly 
discriminatory response and characteristics that support efficient sampling and 
search. This Section describes the method used for modelling body part shape 
and the use of image measurements to score part hypotheses. It concludes with 
an investigation of the resulting response. 

2.1 Modelling Shape 

Current systems often use 2D or 3D geometric primitives such as ellipses, rectan- 
gles, cylinders and tapered superquadrics to represent body parts (e.g. [3,4,5]). 
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These are convenient but rather ad hoc approximations. Instead, prohahilistic re- 
gion templates are used here as body part primitives. Due to the limited presence 
of perspective effects and 3D shape variation, a 2D model with depth ordering is 
used to represent the body. A variation of the scaled prismatic model [9] is used 
to parameterise the transformed appearance. This reduces the dimensionality 
compared to a 3D model and removes kinematic singularities [10]. 

A body part, labelled here by i{i G is represented using a single proba- 

bilistic region template, Mj, which represents the uncertainty in the part’s shape 
without attempting to enable shape instances to be accurately reconstructed 
This is particulary important for efficient sampling when the subject wears lose 
fitting clothing. The probability that an image pixel at position (x, y) belongs 
to a hypothesised part i is then given by Mi(Ti{x, y)) where is a linear trans- 
formation from image coordinates to template coordinates determined by the 
part’s centre, (xc,yc), image plane rotation, 0, elongation, e, and scale, s. The 
elongation parameter alters the aspect ratio of the template and is used to ap- 
proximate rotation in depth about one of the part’s axes. The probabilities in 
the template are estimated from example shapes in the form of binary masks 
obtained by manual segmentation of training images in which the elongation is 
maximal (i.e. in which the major axis of the part is parallel to the image plane). 
These training examples are aligned by specifying their centres, orientations and 
scales. Un-parameterised pose variations are marginalised over, allowing a re- 
duction in the size of the state space. Specifically, rotation about each limb’s 
major axis is marginalised since these rotations are difficult to observe. The 
templates are also constrained to be symmetric about this axis. It has been 
found, due to the insensitivity of the likelihood model described below to precise 
contour location, that upper and lower arm and leg parts can reasonably be rep- 
resented using a single template. This greatly improves the sampling efficiency. 
Some learnt probabilistic region templates are shown in Fig. 1. The uncertain 
regions in these templates arise because of (i) 3D shape variation due to change 
of clothing and identity, (ii) rotation in depth about the major axis, and (iii) 
inaccuracies in the alignment and manual segmentation of the training images. 



2.2 Single Part Likelihood 

Several methods for body part detection have been proposed although in the 
opinion of the authors much work remains to be done. Matching geometric prim- 
itives to an edge field is popular, e.g. [11]. Wachter and Nagel [3] used only the 
edges that did not overlap with other parts. Sidenbladh et al. [12] emphasised 
learning the distribution of foreground and background filter responses (edge, 
ridge and motion) rather than forming ad hoc models. Ronfard et al. [8] learned 
part detectors from Gaussian derivative filters. Another popular method is mod- 
elling the background, but this has the obvious limitation of requiring knowledge 
of the empty scene. Matching model boundaries to local image gradients often 

^ Note that while it would be possible to represent the body parts using a set of basis 
regions, the mean was found to be sufficient here. 
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Fig. 1. Head, torso and limb probabilistic region templates. The upper and lower arm 
and legs are represented using a single mask (increasing sampling efficiency). Notice 
the masks’ symmetries. 



results in poor discrimination. Furthermore, edge responses provide a relatively 
sparse cue which necessitates dense sampling. In order to achieve accurate results 
in real world scenes the authors believe that a description that takes account of 
colour or texture is necessary. To accomplish this the high-level shape model can 
be used earlier in the inference process. One might envisage learning a model 
that described the wide variation in the foreground appearance of body parts 
present in a population of differently clothed people. Such a model would seek 
to capture regularities due to the patterns typically used in clothing. However, 
such an approach would require a high dimensional model and prohibitively large 
amounts of training data. Furthermore, it would not be strongly discriminatory 
because most clothing and image regions are uniformly textured. 




Fig. 2. The flow of data: A lower leg body part probabilistic region template is trans- 
formed into the image. The spatial extent of the template is such that the areas (in the 
probabilistic sense) of the foreground and background regions are approximately equal. 
The probabilistic region is used to estimate the foreground appearance and adjacent 
background appearance histograms. A likelihood is learnt based upon the divergence 
of the two histograms. 
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The approach taken here is to use the dissimilarity between the appearance 
of the foreground and background of a transformed probabilistic region as il- 
lustrated in Fig. 2. These appearances will be dissimilar as long as a part is 
not completely camouflaged. The appearances are represented here as PDFs of 
intensity and chromaticity image features, resulting in 3D distributions. In gen- 
eral, local Alter responses could also be used to represent the appearance (c.f. 
[13]). Since texture can often result in multi-modal distributions, each PDF is 
encoded as a histogram (marginalised over position). For scenes in which the 
body parts appear small, semi-parametric density estimation methods such as 
Gaussian mixture models would be more appropriate. The foreground appear- 
ance histogram for part i, denoted here by Fi, is formed by adding image features 
from the part’s supporting region proportional to Mi{Ti{x, y)). Similarly, the ad- 
jacent background appearance distribution, Bi, is estimated by adding features 
proportional to 1 — Mi{Ti{x,y)). 

It is expected that the foreground appearance will be less similar to the 
background appearance for configurations that are correct (denoted by on) than 
incorrect (denoted by bh). Therefore, a PDF of the Bhattacharya measure given 
by Equation (1) is learnt for on and bn configurations [14]. The on distribution 
was estimated from data obtained by manually specifying the transformation 
parameters to align the probabilistic region template to be on parts that are 
neither occluded nor overlapping. The bn distribution was estimated by gener- 
ating random alignments elsewhere in 100 images of outdoor and indoor scenes. 
The on PDF can be adequately represented by a Gaussian (although in fact the 
distribution is skewed). Equation (2) defines SINGLE^ as the ratio of these two 
distributions. This is the response used to score a single body part configuration 
and is plotted in Fig. 3. 



/(F„ Bi) = Y, 

f 


(1) 


p{I{Fi,B^)\on) 


(2) 



2.3 Enhancing Discrimination Using Adjoining Regions 

When detecting single body parts, the performance can be improved by dis- 
tinguishing positions where the background appearance is most likely to differ 
from the foreground appearance. For example, due to the structure of clothing, 
when detecting an upper arm, adjoining background areas around the shoul- 
der joint are often similar to the foreground appearance (as determined by the 
structural model used here to gather appearance data). The histogram model 
proposed thus far, which marginalises appearance over position, does not use 
this information optimally. To enhance discrimination, two separate adjacent 
background histograms are constructed, one for adjoining regions and another 
for non-adjoining regions. It is expected that the non-adjoining region appear- 
ance will be less similar to the foreground appearance than the adjoining region 
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Fig. 3. Left: A plot of the learnt PDFs of foreground to background appearance simi- 
larity for the on and on part confignrations of a head template. Right: The log of the 
resnlting likelihood ratio. It can be seen that the distribntions are well separated. 



appearance. Currently, the adjoining and non-adjoining regions are specified 
manually during training by a hard threshold. A probabilistic approach, where 
the regions are estimated by marginalising over the relative pose between adjoin- 
ing parts (to get a low dimensional model), would be better, but requires large 
amounts of training data. It is important to note that this is only important, and 
thus used for, better bottom-up identification of body parts. When the adjoining 
part is specified using a multiple part configuration, the formulation presented 
later in Section 3.1 is used. 

2.4 Single Part Response Investigation 

The middle column of Fig. 4 shows the projection of the likelihood ratio com- 
puted using Equation (2) onto typical images containing significant clutter. The 
top image shows the response for a head while the other two images show the 
response of a vertically-oriented limb filter. It can be seen that the technique 
is highly discriminatory, producing relatively few false maxima. Note the false 
response in between the legs in the second image: the space between the legs is 
itself shaped like a leg. Although images were acquired using various cameras, 
some with noisy colour signals, system parameters were fixed for all test images. 

In order to provide a comparison with an alternative method, the responses 
obtained by comparing the hypothesised part boundaries with edge responses 
were computed in a similar manner to Sidenbladh and Black [12]. These are 
shown in the rightmost column of Fig. 4. Orientations of significant edge re- 
sponses for foreground and background configurations were learned (using deriva- 
tives of the probabilistic region template), treated as independent and normalised 
for scale. Contrast normalisation was not used. Other formulations (e.g. averag- 
ing) proved to be weaker on the scenes under consideration. The responses using 
this method are clearly less discriminatory. 

Fig. 5 illustrates the typical spatial variations of both the body part likeli- 
hood response proposed here and the edge-based likelihood. The edge response, 
whilst indicative of the correct position, has significant false positive likelihood 
ratios. The proposed part likelihood is more expensive to compute than the 
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Fig. 4. First column: Typical input images from both outdoor and indoor environ- 
ments. Second column: projection of the log likelihood (positive only, re-scaled) from 
the part filters. Third column: projection of the log likelihood ratio (positive only, re- 
scaled) for an edge-based model. First row: head model. Second and third rows: limb 
model (vertical orientation). 



edge-based filter (approximately an order of magnitude slower in our implemen- 
tation). However, it is far more discriminatory and as a result, fewer samples are 
needed when performing pose search, leading to an overall performance benefit. 
Furthermore, the collected foreground histograms are useful for other likelihood 
measurements as discussed below. 

3 Body Pose Estimation with Partial Configurations 

Since any single body part likelihood will result in false positives it is important 
to encode higher order relationships between body parts to improve discrimina- 
tion. In this system this is accomplished by encoding an expectation of structure 
in the foreground appearance and the spatial relationship of body parts. 

3.1 Extending Probabilistic Regions to Multi-part Configurations 

Configurations containing more than one body part can be represented us- 
ing a straightforward extension of the probabilistic region approach described 
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Position 



Fig. 5. Comparison of the spatial variation (plotted for a horizontal change of 200 
pixels) of the learnt log likelihood ratios for the model presented here (left) and the 
edge-based model (right) of the head in the first image in Fig. 4. The correct posi- 
tion is centered and indicated by the vertical bar. Anything above the horizontal bar, 
corresponding to a likelihood ratio of 1, is more likely to be a head than not. 



above. In order to account for self-occlusion, the pose space is represented 
by a depth ordered set, V, of probabilistic regions with parts sharing a com- 
mon scale parameter, s. When taken together, the templates determine the 
probability that a particular image feature belongs to a particular parts fore- 
ground or background. More specifically, the probability that an image feature 
at position (x, y) belongs to the foreground appearance of part i is given by 
Mi{Ti{x,y)) X Ylj {1 — Mj{Tj{x,y)) where j labels closer, instantiated parts. 
Forming the background appearance is more subtle since some parts often have 
a similar appearance. Therefore, a list of paired body parts is specified manually 
and the background appearance histogram is constructed from features weighted 
by (1 — Mk{Tk{x,y)) where k labels all instantiated parts other than i and 
those paired with i. Thus, a single image feature can contribute to the fore- 
ground and adjacent background appearance of several parts. When insufficient 
data is available to estimate either the foreground or the adjacent background 
histogram (as determined using an area threshold) the corresponding likelihood 
ratio is set to one. 



3.2 Inter-part Joint Constraints 

A link is introduced between parts i and j if and only if they are physically 
connected neighbours. Each part has a set of control points that link it to its 
neighbours. A link has an associated value LINKij given by: 



LINK.j 



1 if 

^(Sij/s-Aij)lcr otherwise 



(3) 



where 6ij is the image distance between the control points of the pair, Aij is the 
maximum un-penalised distance and a relates to the strength of penalisation. 
If the neighbouring parts do not link directly, because intervening parts are not 
instantiated, the un-penalised distance is found by summing the un-penalised 
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Fig. 6. Left: A plot of the learnt PDFs of foreground appearance similarity for paired 
and non-paired configurations. Right: The log of the resulting likelihood ratio. It can 
be seen, as would be expected, that more similar regions are more likely to be a pair. 



distances over the complete chain. This can be interpreted as a force between 
parts equivalent to a telescopic rod with a spring on each end. 



3.3 Learnt Paired Part Similarity 



Certain pairs of body parts can be expected to have a similar foreground appear- 
ance to one another. For example, a person’s upper left arm will nearly always 
have a similar colour and texture to the upper right arm. In the current sys- 
tem, the limbs are paired with their opposing parts. To encode this knowledge, 
a PDF of the divergence measure (computed using Equation (1)) between the 
foreground appearance histograms of paired parts and non-paired parts is learnt. 
Equation (4) shows the resulting likelihood ratio and Fig. 6 graphs this ratio. 
Fig. 7 shows a typical image projection of this ratio and shows the technique 
to be highly discriminatory. It limits possible configurations if one limb can be 
found reliably and helps reduce the likelihood of incorrect large assemblies. 



PAIRij = 



p{I{Fi,Fj)\oni,onj) 

p{i{Fi,Fj)\m~mi~) 



(4) 



3.4 Combining the Likelihoods 

Learning the likelihood ratios allows a principled comparison of the various cues. 
The individual likelihood ratios are combined by assuming independence and 
the overall likelihood ratio is given by Equation(5). This rewards correct higher 
dimensional configurations over correct lower dimensional ones. 

i? = n SINGLE, X PAIR,^j x LINK,^^ 

i&V i,j&V i,j&V 



( 5 ) 
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Fig. 7. Investigation of a paired part response. Left: an image for which significant 
limb candidates are found in the background. Right: the projection of the likelihood 
ratio for the paired response to the person’s lower right leg in the image. 



3.5 Pose Estimation Results 

The sampling scheme is described only briefly here as the emphasis of this paper 
is on a new formulation and likelihood. The search techniques will be more 
fully developed in future work. It is emphasised that the aim of the sampler 
is treated as one of maximisation rather than density estimation. The system 
begins by making a coarse regular scan of the image for the head and limbs. 
These results are then locally optimised. Part configurations are sampled from 
the resulting distribution and combined to form larger configurations and then 
optimised (in the full dimensional pose space, including the body part label) 
for a fixed period of time. It is envisaged that, due to the flexibility of the 
parametrisation, a set of optimization methods such as genetic style combination, 
prediction, local search, re-ordering and re-labelling can be combined using a 
scheduling algorithm and a shared sample population to achieve rapid, robust, 
global, high dimensional pose estimation. The system was implemented using an 
efficient, in-house C-|— I- framework. Histograms with 8x8x8 bins were used to 
represent a part’s foreground and adjacent background appearance. The system 
samples single part configurations at the scale shown in Fig. 2 at approximately 
3KHz from an image with resolution 640 x 480 on a 2GHz PC. Fig. 8 shows 
results of searching for partial pose configurations. It should be emphasised that 
although inter-part links are not visualised here, these results represent estimates 
of pose configurations with inter-part connectivity as opposed to independently 
detected parts. The scale of the model was fixed and the elongation parameter 
was constrained to be above 0.7. 

4 Summary 

A system was presented that allows detailed, efficient estimation of human pose 
from real-world images. The focus of the paper was the investigation of a novel 
likelihood model. The two key contributions were (i) a formulation that allowed 
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Fig. 8. Results from a search for partial pose configurations. The images are of both 
indoor and outdoor scenes and contain a significant amount of background clutter and 
in one case a door which partially occludes the subject. The samples with maximum 
score after searching for 2 minutes are shown. 



the representation and comparison of partial (lower dimensional) solutions and 
modelled other object occlusion and (ii) a highly discriminatory learnt likelihood 
based upon probabilistic regions that allowed efficient body part detection. It 
should be stressed that this likelihood depends only on there being differences 
between a hypothesised part’s foreground appearance and adjacent background 
appearance. It does not make use of scene-specific background models and is, 
as such, general and applicable to unconstrained scenes. The results presented 
confirm that it is possible to use partial configurations and a strong likelihood 
model to localise the body in real-world images. To improve the results, future 
work will need to address the following issues. A limited model of appearance 
was employed based on colour values. Texture orientation features should be 
employed to disambiguate overlapping parts (e.g. the arm lying over the torso). 
The model should be extended through closer consideration of the distinction 
between structural (kinematic) and visual segmentation of the body. The as- 
sumptions of independence between the individual likelihoods, particularly for 
the link and paired appearance likelihoods, needs investigation. Lastly, and per- 
haps most importantly, future work needs to improve the sampler to allow high 
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dimensional configurations that contain self occlusion and visually similar neigh- 
bouring parts to be localised. 
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Abstract. Tensor fields (matrix valued data sets) have recently at- 
tracted increased attention in the fields of image processing, computer 
vision, visualization and medical imaging. Tensor field segmentation is 
an important problem in tensor field analysis and has not been addressed 
adequately in the past. In this paper, we present an effective region-based 
active contour model for tensor field segmentation and show its applica- 
tion to diffusion tensor magnetic resonance images (MRI) as well as for 
the texture segmentation problem in computer vision. Specihcally, we 
present a variational principle for an active contour using the Euclidean 
difference of tensors as a discriminant. The variational formulation is 
valid for piecewise smooth regions, however, for the sake of simplicity of 
exposition, we present the piecewise constant region model in detail. This 
variational principle is a generalization of the region-based active con- 
tour to matrix valued functions. It naturally leads to a curve evolution 
equation for tensor held segmentation, which is subsequently expressed 
in a level set framework and solved numerically. Synthetic and real data 
experiments involving the segmentation of diffusion tensor MRI as well 
as structure tensors obtained from real texture data are shown to depict 
the performance of the proposed model. 



1 Introduction 

Tensor fields are the essential components in many applications like DT-MRI 
processing, texture image segmentation, solid and fluid mechanics etc. Several 
interesting problems constitute tensor field analysis in the context of imaging 
applications, for example : tensor field data acquisition, restoration, segmenta- 
tion and visualization. Though, much effort has been expended on tensor field 
data acquisition, restoration and visualization, tensor field segmentation has not 
been adequately addressed in the past. In this paper, we will address the general 
problem of tensor field segmentation and then depict examples of application 
of the algorithm to medical image analysis, specifically to diffusion tensor MRI 
segmentation and additionally to texture image segmentation. In the following, 
we will present a brief overview of various techniques currently invogue in using 
tensor-based information for segmenting motion fields, textures and DT-MRI. 

* This research was in part funded by the NIH grant RO1-NS42075 
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There are many algorithms in literature for motion segmentation, however, 
not many of them use the structure tensor. Kiihne et.al [11] proposed an inter- 
esting tensor-driven active contour model for moving object segmentation, a 3D 
structure tensor is computed in the spatio-temporal domain of a video sequence 
and is used to create the stopping function in a geometric active contour model. 
The results they shown are quite promising for motion segmentation. Some other 
examples of published research on the use of the structure tensor in the context 
of optical flow computation are [4,9]. 

Recently, Rousson et.al, in [12] developed a technique for segmenting textures 
where in they first construct texture features based on the image and its structure 
tensor, then use an active and adaptive contour model to segment this feature 
vector held. The texture segmentation results are very impressive in their work. 

In the context of DT-MRI segmentation, recently, Zhukov et.al., [18] proposed 
a level set segmentation method which is in fact a segmentation of a scalar 
anisotropic measure of the diffusion tensor. The fact that Zukhov et.al., [18] use 
a scalar held computed from the diffusion tensor held implies they have ignored 
the direction information contained in the tensor held. Thus, this method will fail 
if two homogeneous regions of tensor held have the same anisotropy property but 
are oriented in a totally different direction! Moreover, any of the numerous (well 
tested and well understood) scalar image segmentation techniques could have 
been employed for achieving the goal of segmenting the scalar held of anisotropy 
measures. In contrast, we present an algorithm to segment tensor held using all 
the information contained in a tensor, not only scalar anisotropy properties, but 
also its orientation. 

To the best of our knowledge, there is no published work in literature which 
aims to segment tensor fields. In this paper, we tackle the tensor held segmen- 
tation problem using an effective region based active contour model. Geometric 
active contour model has long been used in scalar and vector images segmenta- 
tion [5,6,13,14,10,17]. Our work can be viewed as an extension of the work on the 
region-based active contours ([7], [8], [17]), to matrix valued images. These region- 
based active contours are curve evolution implementation of the Mumford-Shah 
functional [15]. Our key contribution is the incorporation of a discriminant of 
tensors into the region based active contour model and to show its effectiveness 
in tensor field segmentation. The specific discriminant we use is the Forbenius 
norm of the difference of two tensors. Although this norm has been used in the 
past for tensor held restoration, to the best of our knowledge, it has never been 
used for tensor held segmentation. 

Rest of the paper is organized as follows: in section 2, the piecewise smooth 
and piecewise constant region-based active contour models for tensor held seg- 
mentation are described. The Euler-Lagrange equation and the curve evolution 
equation are given for the piecewise constant model for simplicity of exposition. 
Section 3 contains a detailed description of the level set formulation and the 
implementation using an explicit scheme. In section 4, we present experiments 
on application of our model to synthetic as well as real data. Finally in section 
5, we discuss the pros and cons of our approach and some future directions. 
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2 Model Description 

Our model for tensor field segmentation in Ft? is posed as minimization of the 
following variational principle based on the Mumford-Shah functional ([15]): 

S(T,C) = / dtst(T(x,To(x))2dx + a / ||VT(x)f dx + /3|C| (1) 

Jn JQ/c 

Where the curve C is the boundary of the desired unknown segmentation, 17 is 
the image domain, To is the original noisy tensor field, T is a piecewise smooth 
approximation of Tq with discontinuities only along C, VT is a component wise 
gradient of each element of the tensor, jCj is the arclength of the curve C, a 
and /3 are control parameters, dist {., .) is a measure of the distance between two 
tensors. 

The above variational principle will capture piecewise smooth regions while 
maintaining a smooth boundary. A simplified form which aims to capture two 
types of piecewise constant regions is given by: 

A(C,Ti,T2)= [ disf2(T(x),Ti)dx+ [ disf{T{x),T2)dx + P\C\ (2) 

JR JR<= 

where R is the region enclosed by C and R'^ is the region outside C. 

The above model in equation (2) can be viewed as a modification of the ac- 
tive contour model without edges for scalar valued images in [7] . The difference 
measures in [7] for the scalar values are simple, be it intensity or curvature. In 
the proposed model, a key issue will be the right choice of a tensor difference 
measure. Any other segmentation models that are generalizations of the one pro- 
posed here for tensor fields, will unavoidably encounter this fundamental prob- 
lem. Alexander et.al., [1] discussed different similarity measures for matching of 
diffusion tensor images and indicated that the Euclidean difference measure of 
tensors is the best in the context of image registration. The Euclidean difference 
metric is defined as follows: 

dist{A, B) = \J trace[(A — B)^] = ||A — B\\f (3) 

where A and B are two rank 2 tensor of the same size, or simply two matrices 
of the same size, |1 .||_f is the matrix Frobenius Norm. 

In the context of tensor field segmentation, we also found that the Euclidean 
difference measure of tensors is a good choice. We define the mean value of 
tensors in a region R as: 

fi{T; R) = minfj, / [dist(/x, T(x)]^ dx (4) 

Jr 

When we choose dist {., .) to be the Euclidean difference measure, it is not hard 
to verify that /r(T) = T(x)dx/|i?|. 

We followed the two phases implementation of ([7]). First fixed C, then Ti = 
/x(T; R) and T 2 = /r(T; Then fixed Ti and T 2 , the Euler Lagrange equation 
for the variational principle (2) is: 

[Pk - dist^iT, Ti) -k dist'^iT, T 2 )] N = 0 
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where N is the outer normal of the curve C. We then have the corresponding 
gradient flow or the curve evolution form for the above equation as: 

BC 

— = - [/3fc - Ti(t)) + dist^iT:, T2(t))] N 

/^T(x)dx /«cT(x)dx 

- ~^\ ’ - m 

This can be easily solved numerically as described subsequently. In a similar 
fashion, one can write down the curve evolution equation for equation (1). 



3 Level Set Implementation and Numerical Methods 



The curve evolution equation (5) can be easily implemented in a level set frame- 
work. The corresponding level set formulation is given by: 



B4> 

Ti = 



(3div { . ) — dist^(T,Ti) + dist^(T,T 2 ) 



/^(l-gW)T(x)d: 



m 



To = 



!nH{(j))T{Td)dy. 



(6) 



where H{.) is the Heaviside function, H{(j){x)) = 0 for x G R and H{(j>{x.)) = 1 
otherwise. Equation (6) can be easily discretized using an explicit Euler scheme. 
We can assume the spatial grid size to be 1, then the finite differences of the 
partial derivatives are: 

^ 4’i,j = ~ 4’i,j = + l ~ 

~ 1,J5 

^ 4*1 J — 4*i+l,j T 4*i — lJ 

4*1,3 = + l ~ ~ 4*i-l,j + l + 4*i-l,3-l) 

4*i,j ~ 4*i,j+l ^4*i,j T 4*i,j — l 



In this case, we have the following update equation: 



1,3 



At 






_ ^ E.,3i^-H{€,3)m,3 ^ E.,3 H{<P2,3m,3 



^^\'ri,3// '^i,j dT(j4*i j) 

where the curvature of (/)" can be computed as: 

A^^4>l^{A^4>l3f - 2A^l3A^l^A*^l3 + A^^4>l^{Ai4>l3? 



(7) 



F" — 



[{A^4*ijY + {ARjoi^y- 



l3/2 



(8) 
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(a) (b) (c) (d) 

Fig. 1. Segmentation of a synthetic tensor field where two regions differs only in the 
orientations, (b),(c) and (d) are the initial, intermediate and final steps of the curve 
evolution process in the segmentation. 



There are many other efficient numerical schemes that one may employ for 
example the multigrid scheme as was done in Tsai et.ah, [17]. At this time, our 
explicit Euler scheme yielded reasonably fast solutions (3-5secs. for the synthetic 
data examples and just under a minute for the real data examples on a IGhz 
Pentium-3 CPU). For the piecewise smooth model, we refer the readers to ([8], 
[17]) for implementation details. 

4 Experimental Results 

In this section, we present several sets of experiments on the application of our 
tensor field segmentation model. One is on 2D synthetic data sets, the second is 
on texture images, the third one and the last one are on slices of diffusion tensor 
fields estimated from diffusion weighted images. We apply the piecewise constant 
case of our model for the first three sets of examples and the original piecewise 
smooth model for the last example. In all examples, the evolving boundary of 
the segmentation are superimposed on the images either in black or white. 

4.1 Synthetic Tensor Field Segmentation 

We synthesize two tensor fields, both are 2x2 symmetric positive definite matrix 
valued images on a 128 x 128 lattice and have two homogeneous regions. The 
two regions in the first tensor field only differ in the orientations while the two 
regions in the second tensor field only differ in the scales. These two tensor 
fields are visualized by ellipses as shown in Figure 1(a) and Figure 2(a). With 
an arbitrary initialization of the geometric active contour, our proposed model 
can yield high quality segmentation results as show in Figure 1 and Figure 2. 
Note that the first tensor field can’t be segmented by using scalar anisotropic 
properties of tensors as in [18] and the second tensor field can’t be segmented by 
using the dominant eigen vectors of the tensors. These two examples show that 
one must use the full information contained in tensors. 
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(a) (b) (c) (d) 

Fig. 2. Segmentation of a synthetic tensor field where two regions differs only in the 
scales, (b), (c) and (d) are the initial, intermediate and final steps of the curve evolution 
process in the segmentation. 




(a) 



(b) 



(c) 





(d) 



(e) 



(f) 



Fig. 3. Texture segmentation for a heart shape region: (b) and (c) are the initial and 
hnal curve superimposed on the texture image, (d), (e) and (f) are the intermediate 
and final steps with the evolving curve superimposed on the structure tensor field. 
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Fig. 4. Texture segmentation for a region showing ”ECCV” logo: (b) and (c) are the 
initial and final curve superimposed on the texture image. (d),(e) and (f) are the 
intermediate and final steps with the evolving curve superimposed on the structure 
tensor field. 



Fig. 5. A slice of the diffusion tensor field of a normal rat spinal cord. Each component 
is shown as a scalar image. Left to right : Dyy, D^z, Dxy, Dyz and Dxz respec- 

tively, the offdiagonal terms Dxy, Dyz and Dzz are greatly enhanced by brightness and 
contrast changes for better visualization. 
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Fig. 6. A slice of the diffusion tensor field of a normal rat brain. Each component is 
shown as a scalar image. Left to right : D^x, Dyy, D^z, D^y, Dyz and respec- 
tively, the offdiagonal terms D^y, Dyz and Dzz are greatly enhanced by brightness and 
contrast changes for better visualization. 




(a) (b) (c) (d) 

Fig. 7. Segmentation of the slice of the diffusion tensor image shown in figure (5). 
(a)-(d) are the initial, intermediate and final steps of the curve evolution process in 
segmenting the gray matter inside the spinal cord. 




(a) (b) (c) (d) 



Fig. 8. Segmentation of the slice of the diffusion tensor image shown in figure (6). 
(a)-(d) are the initial, intermediate and final steps of the curve evolution process in 
segmenting the corpus callosum inside the rat brain. 
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(a) (b) (c) 

Fig. 9. Use piecewise smooth model to segment the slice of the diffusion tensor image 
shown in figure (6). (a) and (b) show the boundary of the final segmentation superim- 
posed on the smoothed and the original tensor field respectively, (c) is the smoothed 
tensor field. 



4.2 Texture Image Segmentation 

For texture image segmentation, we construct the structure tensor field from 
the given image and then segment the structure tensor field using our proposed 
model. The structure tensor is defined as [3]: 

Jp = Kp* (V/V/^) 

where Kp is a Gaussian smoothing function with standard deviation p. Figures 3 
and 4 respectively show that our method can yield reasonable quality texture 
segmentations. Note that the segmentations are not very accurate along the edges 
of the original texture images. This is because we did not use any anisotropic 
smoothing for the structure tensor field as in [16], thus the edges in the original 
image were not preserved. It is however easy to incorporate such an anisotropic 
smoothing to yield better quality segmentations and will be the focus of our 
future work. Figure 4 also shows that topological change of the regions can be 
achieved easily in a level set framework. 

4.3 Diffusion Tensor Image Segmentation 

Diffusion tensor MRI (DT-MRI) is a relatively new MR imaging modality from 
which anisotropy of water diffusion can be inferred quantitatively [2], thus pro- 
viding a method to study the tissue microstructure e.g., white matter connec- 
tivity in the brain in vivo. Diffusion is a process of movement of molecules as 
a result of random thermal agitation and in DT-MRI context, refers specifi- 
cally to the random translational motion of water molecules in the part of the 
anatomy being imaged with MR. In three dimension, water diffusivity can be 
described by a 3 x 3 symmetric positive definite matrix D called diffusion tensor 
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which is intimately related to the geometry and organization of the microscopic 
environment. 

In DT-MRI, what is measured is the diffusion weighted echo intensity image 
(DWI) Si - They are related to the diffusion tensor D through the Stejskal-Tanner 
equation [2] as given by: 

Si = = S'oe" (9) 

where bi is the diffusion weighting of the l-th magnetic gradient, denotes 
the generalized inner product for matrices. Given several non-collinear diffusion 
weighted intensity measurements, D can be estimated via multivariate regression 
models. 

Figure 5 shows a slice of the diffusion tensor field estimated from the DWIs 
of a normal rat spinal cord and Figure 6 shows the same for a normal rat brain. 
Each of the six independent components of the individual symmetric positive 
definite diffusion tensors in the tensor field is shown as a scalar image. Figure 7 
demonstrates the segmentation of the gray matter inside the normal rat spinal 
cord with the evolving curve superimposed on the ellipsoid visualization of the 
diffusion tensor field. Similarly, Figure 8 depicts the segmentation procedure for 
the normal rat brain. In the final step, the major part of the corpus callosum 
is captured by the piecewise constant segmentation model. In both cases, we 
exclude the free water region which is not of interest in a biological context. 

4.4 DTI Segmentation Using the Piecewise Smooth Model 

Most of the previous examples have been successfully segmented using the piece- 
wise constant model, however with one exception as shown in Figeure 8 because 
the piecewise constant assumption is no longer valid. Thus we further employ 
the piecewise smooth model to refine the segmentation result in Figure 8 and 
show the result of this application in Figure 9. Note that the horns of the cor- 
pus callosum have been accurately captured using the piecewise smooth model 
unlike when using the piecewise constant region model used in Figure 8. 

5 Conclusion 

We presented a tensor field segmentation method by incorporating a discriminant 
for tensors into a region-based active contour model. The particular discriminant 
we employed is the Euclidean difference measure between tensors. By using a 
discriminant on tensors, as opposed to either the eigen values or the eigen vectors 
of the tensors, we make full use of all the information contained in tensors. This 
proposed model is then implemented in a level set framework to take advantage 
of the easy ability of this framework to change topologies when desired. Our 
approach was applied to 2D synthetic and diffusion tensor field segmentation 
as well as texture image segmentation by using its structure tensor field. The 
experimental results are very good, essential part of the regions are well captured 
and topological changes are handled naturally. 
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The given model here can be further improved in many ways. Our future 
work will include the following: (i) a better discriminant of tensors needs to be 
used. Though the Euclidean difference measure use the full tensor information, it 
blindly uses the same weights for different components of the tensor and ignores 
the fact that tensors have structure, (ii) shape statistics can be incorporated to 
improve the robustness and accuracy of the current model. 

Acknowledgment. We thank Dr. T. Mareci and E. Ozarslan for providing the 
DT-MRI data and Dr. R. Deriche for his valuable comments on this research. 
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Abstract. We describe a framework for registering a group of images 
together using a set of non-linear diffeomorphic warps. The result of the 
groupwise registration is an implicit definition of dense correspondences 
between all of the images in a set, which can be used to construct statis- 
tical models of shape change across the set, avoiding the need for manual 
annotation of training images. We give examples on two datasets (brains 
and faces) and show the resulting models of shape and appearance vari- 
ation. We show results of experiments demonstrating that the groupwise 
approach gives a more reliable correspondence than pairwise matching 
alone. 



1 Introduction 

We address the problem of determining dense correspondences across a set of 
images of similar but varying objects, a key problem in computer vision. Given 
such a set of images and their correspondences, an annotation of one image can 
be propagated to all of the others, and statistical shape models of the appear- 
ance and variations of the set of images can be built. Furthermore, a method 
of automatically determining the correspondences leads to a system capable of 
learning statistical models of appearance in an entirely unsupervised fashion. 

Registration of pairs of images has been extensively studied for medical im- 
ages, with many different non-rigid registration algorithms being proposed to 
deform one image until it matches a second, see for example [11]. These methods 
can be extended to finding correspondences across a set of images by registering 
each image in the set to a chosen reference image using pairwise methods [18]. 
However, only by examining a whole set of images of a class of objects can 
one learn which are the important features. We this cannot be determined from 
pairwise approaches alone. Following recent work on landmark correspondence 
for sets of shapes [6], we propose that the groupwise correspondence problem 
should explicitly optimise functions that measure the quality of the correspon- 
dence across the whole set of images simultaneously. 

We believe that for non-rigid registration the warping functions should be 
continuous, smooth and invertable, so that every point in image A maps to ex- 
actly one point in image B, and vice-versa. Such smooth, invertable functions 
are known as dijfeomorphisms. The mathematics of diffeomorphism groups is 
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complex and mysterious - they form infinite dimensional groups and smooth 
manifolds, yet are not Lie groups. However, usefully complex sets of diffeomor- 
phic functions can be constructed by the composition of simple basis functions 
(see section 3.2). In cases where structures appear or disappear between one im- 
age and the next, these should be explicitly modelled as creation or destruction 
processes - such processes will not be addressed in this paper. 

This paper proposes a general framework for computing diffeomorphisms 
that define dense correspondences across a set of images so as to minimise a 
groupwise objective function based on ideas of minimum message length. We will 
first introduce a novel pairwise algorithm capable of achieving a diffeomorphic 
mapping between a pair of images, and then generalise this to the groupwise 
case. We present results of applying the algorithm to sets of brain images and 
face images, show the statistical models of shape and appearance constructed 
from the correspondences and demonstrate that the groupwise method gives 
more reliable results than an equivalent pairwise approach. 

2 Background 

Finding mappings between structures across a set of images can facilitate many 
image analysis tasks. One particular area of importance is in medical image in- 
terpretation, where image registration can help in tasks as diverse as anatomical 
atlas matching and labelling, image classification, and data fusion. Statistical 
models may be constructed based on these mappings, and have been found 
to be widely applicable to image analysis problems [5,4]. However, variability 
in anatomy and in capture conditions - both inter-patient and intra-patient - 
means that identifying correspondences is far from straightforward. 

The same correspondence problem is found in computer vision tasks (for 
example, for stereo vision tasks) and in remote sensing. This wide range of 
applications mean that many researchers have investigated image registration 
methods and the use of deformable models, for overviews see for example [22, 
14,11,21]. 

Many algorithms algorithms have been proposed which are driven by the 
maximisation of some measure of intensity similarity between images, such as 
sum-of-squared-intensity differences, or mutual information [10,19]. 

Methods of propagating the deformations across the image include elastic 
deformations [1], viscous fluid models [3], and splines [18,12]. The method that 
is most similar to that proposed here is that of Lotjonen and Makela [9], who 
describe an elastic matching approach in which spherical regions of one image are 
deformed so as to better match the other. However, the deformations do not have 
continuous derivatives at the border, and are therefore not diffeomorphic - in 
the work described in this paper we use a similar representation of deformation, 
but use fully diffeomorphic deformations within ellipsoidal regions. There are 
also similarities with the work of Feldmar and Ayache [7] , who used local affine 
transformations to deform one surface onto another. 

Davies et al. [6] directly addressed the problem of generating optimal cor- 
respondences for building shape models from landmarked data, noting improve- 
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merits in model quality when a ‘groupwise’ cost function was used that was 
based on Minimum Description Length. Spherical harmonic parameters can also 
be used to directly optimise the shape parameterisation [15]. 

Some early work on groupwise image registration based on the discrepancy 
between the set of images and the reference image has been performed [13], and 
a groupwise model-matching algorithm that represents image intensities as well 
as shape has also been proposed [8]. It is also possible to consider building an 
appearance model as an image coding problem [2]. The model parameters are 
iteratively re-estimated after fitting the current model to the images, leading to 
an implicit correspondence defined across the data set. 

3 Pairwise Non-rigid Registration 

In this section we describe our approach to registering a pair of images based 
on the repeated composition of local diffeomorphic warps. The extension to 
registering groups of images is detailed in Section 4. 



3.1 Overview 

We first consider the registration of two images I\ and I 2 ■ This requires finding 
a non-linear deformation function that transforms (warps) image I\ until it is as 
similar as possible (as measured by some objective function) to l 2 - 
More formally, we define the following: 

Image functions /i_ 2 - We assume that the image functions are originally de- 
fined only on some dense set of points (pixels or voxels) with positions Xi^ 2 > 
but that we can interpolate such functions to obtain /i^ 2 (x) at any point 
A warp function W{-; with parameters ^ that acts on sets of points X — s- 
^(X;*^) 

A sampled set of values I (X) from an image at a set of points X 
An objective function Fp^ir (d(X), /'(X')) that computes the ‘similarity’ be- 
tween any two equi-sized samples 
The ‘cost’ of a deformation Gpair(W) for deformation IT(-,#) 

The task of image registration can then be considered as the task of find- 
ing parameters # of the warping function 1T(-,#) that minimise the combined 
objective function: 

#opt = argmm(Fp„,,(/i(Xi),/2(W(Xi,??))) +Gp,i,(lT(-,^))) (1) 

The form of Gpair{-) is typically chosen to penalise more convoluted defor- 
mations, and acts as a regularisation term. 

3.2 Diffeomorphic Warps 

We assume that each image in a set should contain the same structures, and 
hence there should be a unique and invertible one-to-one correspondence between 
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all points on each pair of images. This suggests that the correct representation of 
warps is one that will not ‘tear’ or fold the images, we therefore choose to select 
warps from the diffeomorphism group. For more discussion of this point, see [20]. 
For any two diffeomorphisms /(x), g{:a), their composition (/o g){x) = f (g(x)) 
is also a diffeomorphism. We can thus construct a wide class of diffeomorphic 
functions by repeated compositions of a basis set of simple diffeomorphisms. 

In the Appendix we describe how such sets of hounded diffeomorphisms may 
be constructed. Boundedness is a useful property when performing numerical 
optimisation, as we can take advantage of the fact that only a subset of the 
image samples (those within the area of effect of the warp) will change. Our basis 
warps are parameterised by the movement of the centre point of the ellipsoid 
affected, and the size, position and orientation of the ellipsoidal region. We will 
denote the such warp by fi = /(•, 0 j), where is the set of parameters for 
the z^^ warp. The total warp is then W{-, #) = /„ o /„_io, . . . /2 o o A, where 
A{-, 4>j^) is an affine transformation with parameters and the parameters of 
the total warp are # = {4>a^ • 7 <Pn)- 

3.3 Optimisation Regime 

The representation of complex warps requires many parameters, so that it is not 
feasible to optimise over all the parameters at once. We therefore adopt a sequen- 
tial strategy in which we start with relatively simple warps and incrementally 
compose and optimise additional warps. In practice, we do the following: 

— Optimise the affine registration parameters (pj^ 

— Construct the affinely-warped points = A(Xi,</>^), the zeroth-order 
estimate of the warp and the associated warped points 

— For each non-linear diffeomorphism fi = /(•, 0^), i = 1, . . . , n: 

• For each given set of parameters </>(, apply the local warp f{-,cpi) to 

the current estimate of the warped points, and sample from I 2 

at these new points. This gives estimates of the fully-optimised warp 
IF(Xi,^opt) and the true optimal sample l 2 (W^(Xi, 

• Find the particular parameters <pi that minimise the objective function 
given in Eq. (1), recalculating the estimate of the warp at each stage 

• Update the estimate of the full warp, X^*) = </>j) 

— Output X^”\ the estimate of the true global optimum warp IU(Xi,#Qp^) 

In our implementation we have used downhill simplex in the early stages 
and simple gradient descent in the later stages of the optimisation, although any 
non-linear optimiser could be used. We use a multi-resolution approach to give 
better robustness. The search regime is then defined by the positioning of the 
effective regions of the local warps. In the experiments presented here the regions 
to be warped were disks of randomly chosen position and radii. This approach is 
similar to that described in [9], but their local warps were not smooth at the edges 
of the regions of effect, nor did their construction guarantee diffeomorphisms or 
even differentiability, so that their total warp was not necessarily smooth or 
invertible. 
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4 Groupwise Non-rigid Registation 

4.1 Groupwise Objective Functions 

Suppose that instead of two images we now have a set of N images, li. We wish to 
register these images into a common coordinate frame. Following Davies et al. [6] 
we treat the problem of global registration as one of optimising a groupwise ob- 
jective function that essentially measures the compactness of a statistical model 
built from correspondences resulting from the registration process. For registra- 
tion, a suitable function for optimisation combines both shape (position) and 
intensity components. 

The correspondence between a set of images explicitly defines those structures 
that should be treated as analogous. The contention is that statistical modelling 
of variations between analogous parts of structures should, in some sense, be 
‘simpler’ than modelling variations between non-analogous parts. This idea of 
‘simplicity’, or of appropriateness of the model, is expressed in the Minimum 
Description Length (MDL) framework [17] in terms of the length of an encoded 
message; this message transmits the whole set of examples, encoded by using 
the statistical model defined by the correspondence. Inappropriate choice of cor- 
respondence then leads to a non-optimal encoding of the data, and a greater 
length of the message. Note that this definition of optimal correspondence is 
explicitly concerned with the whole set of images, rather than correspondences 
defined between pairs of examples. We show below how we are able to construct 
an information-theoretic objective function for groupwise non-rigid registration. 

fh 

Let Xi be the points on the image obtained by applying the current 
estimate of the set of warp parameters for the warp between example li and the 
reference image Ii. That is, = VF(Xi;^i). Suppose also that = /^(Xi) is 
the vector of image intensities sampled at those points. We then seek to find the 
set of full warp parameters {^i} that minimise some objective function 

Gv((si,Xi,)... ,(s„X,),... ,(s^,X^)), (2) 

which is chosen to measure the appropriateness of the groupwise correspondence, 
potentially a challenging problem given the dimensionality of both the data set 
and the parametrisation of the warps. 

In the work described here we will treat the shape and texture indepen- 
dently (although, ideally, correlations between shape and texture should also be 
considered). That is, we will use a function of the simplified form: 

Gv((si,Xi,)... ,(s^,X^)) = Fat(si,... ,s^) + Gv(Xi,... ,X^). (3) 

One information-theory based approach to the construction of such an ob- 
jective function, which is suitable for problems where sequential optimisation is 
the only feasible optimisation strategy, is that of Minimum Message Length [17]. 
Each image example is left out in turn, and a model is built using the other iV — 1 
examples and their current correspondences. The length of the message required 
to transmit the left-out example using this model is then calculated. We can 
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then optimise this message length by manipulating the correspondence for the 
missing example. 

To be specific, let P- rX) be an estimate of the model probability density 
function computed from the vectors of all corresponding points leaving out ex- 
ample i; that is, all the sets of points X^-, j yf i. Similarly, let Pj;‘\s) be an 
estimate of the model density function computed from all the texture sample 
vectors Sj = /^(Xy), j ^ i. 

The estimated message length for transmitting example i with the correspon- 
dence defined by X^ = lT(Xi,^i) then leads to the objective function: 

= -logi^("')(vb(Xi,^,)) -Alogi^(*^(/,(lT(Xi, #,))), (4) 

where A represents the relative weighting given to the shape and texture parts. 
By manipulating the warp parameters we manipulate the correspondence 
for image It relative to the rest of the examples, and can hence optimise this 
correspondence for this example by minimising the value of This single- 

example warp optimisation is performed in an analogous fashion to the pairwise 
example given previously. We next describe the full groupwise optimisation strat- 
egy- 

4.2 Groupwise Optimisation Algorithm 

In what follows, we will use the term ‘correspondence’ to denote the corre- 
spondence between images induced by a warp manipulating the set of 

warp parameters ^ manipulates the correspondence. Since our warps are dif- 
feomorphic by construction, all correspondences so defined are one-to-one and 
invertible. The groupwise optimisation algorithm is: 

— Initialisation: perform pairwise non-rigid registrations between each image 
and Ii, giving initial estimate x|°^ for each. 

REPEAT 

• For each i = 2, . . . ,N 

* (Re)compute the model p.d.f.s Pj;^\x.) and pj^^\s), leaving out 
example i from the model building process 

* Find the optimal set of warp parameters that minimise Ci{^i) 

* Update estimate of correspondence for example i using these optimal 
warp parameters, X^ — >• bF(Xi,#i) 

~ UNTIL CONVERGENCE 



5 Results of Experiments 

We have applied the groupwise model building approach to examples of faces 
and 2D MR brain images. For the intensity part our chosen objective function 
is a sum-of-absolute-differences (implying an exponential PDF), a more robust 
statistic that sum-of-squares. In the examples given we have similar ranges of 
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intensity across the set, so do not need to further normalise the intensity ranges. 
The shape component of the objective function is a sum of squares second deriva- 
tives of the deformation field (evaluated at each warped grid point X^) - this 
discourages excessive bending. In the experiments below we heuristically choose 
the factor A that weights the shape objective function relative to the intensity 
measure. We are currently examining ways of automatically selecting suitable 
values. 

We initialise Xi to a grid covering the reference image, and as before let 
Si be the result of sampling image i at the current warped points. Let dX^ be 
the vector concatenation of all the second derivatives evaluated at each of the 
warped grid points. The pairwise objective function we use is 

^ ^ \^ik “t“ \pair\^'^i\ • (b) 

k 

During the groupwise stage, when optimising on image i, we use 

Fgroup{^ = Y, + Ag^^pdXfW-irfX,, (6) 

k 

where Sk is the mean of the sample across the other members of the set, 
Wk is the mean absolute difference from the mean, and is a diagonal matrix 
describing the variances of the elements of dXj for j yf i. This simple gaussian 
model of the distribution is a natural groupwise extension of the commonly used 
bending energy term. It allows more freedom to deform in areas in which other 
images exhibit larger deformations. 



5.1 Corresponding MR Brain Slices 

We applied the method to 16 MR brain slices, each from an image of a dif- 
ferent person, with approximately corresponding axial slices being chosen. The 
optimisation regime for the groupwise algorithm first requires finding the best 
affine transformation, before composing 1500 randomly sized and centred warps 
during the pairwise stage and a further 3000 randomly sized and centred warps 
during the groupwise stage. The algorithm is implemented in C-| — h ^ and the 
optimisation took about 15 minutes on a 2.8GHz PC. 

Figure 1 shows the resulting deformation of one of the brains. We took a 
hand annotation of the reference image and used the acquired warps to pro- 
pogate this to the other images. We then constructed a linear statistical shape 
and appearance model [4] from the resulting annotations. Figure 2 shows the 
two largest modes of shape deformation, while Figure 3 shows the two largest 
modes of combined shape and texture variation. Note that the shape model is 
built from the points of the projected annotation only, not on the dense grid of 
points used in the correspondence process. This allows us to use a sparser repre- 
sentation of the key features only, potentially leading to more compact models. 

^ Using the VXL computer vision library: www.sourceforge.org/projects/vxl 
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Such models give a compact summary of the variation used in the set, and can 
be used to match to further images using rapid optimisation algorithms such as 
the Active Appearance Model [4]. Note that the linear model does not enforce 
diffeomorphisms . 






Reference Brain warped to ref. 



Warp field 



Before registration After registration 



Fig. 1. Example MR slices before and after groupwise registration 




Shape Mode 1 (±2 s.d. from mean) 



Shape Mode 2 (±2 s.d. from mean) 



Fig. 2. Two largest modes of shape variation of a model built from 2D brain slices 
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Mode 1 (—2 s.d.,mean,+2 s.d.) Mode 2 (±2 s.d. from mean) 

Fig. 3. Two largest modes of appearance variation (model built from 2D brain slices) 



5.2 Corresponding Face Images 

We took 51 face images, each of a different person, from the XM2VTS face 
database [16] ^ We applied the groupwise registration to find correspondences, 
which took about 30 minutes on a 2.8GHz PC. As before, we propogated an 
annotation of the reference image to the rest of the set and constructed a linear 
model of appearance. Figure 4 shows the two largest modes of shape deformation, 
while Figure 5 shows the two largest modes of combined shape and texture 
variation. The crispness of the resulting appearance model demonstrates that an 
accurate correspondence has been achieved. 




Shape Mode 1 (±2 s.d. from mean) 




Shape Mode 2 (±2 s.d. from mean) 



Fig. 4. Two largest modes of shape variation of a model built from 51 face images 




Mode 1 (—2 s.d., mean, +2 s.d.) Mode 2 (±2 s.d. from mean) 



Fig. 5. Two largest modes of appearance variation of a model built from 51 face images 

^ We selected the first 51 people without glasses or facial hair. Such features, which 
appear or dissappear from one image to another, break the assumptions of diffeo- 
morphic correspondence in the process. 
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Table 1. Point-curve errors after registration of 51 face images. (The faces are approx- 
imately 100 pixels wide) 





Errors (pixels) 


Mean 


SD 


Max 


After initial pairwise 


2.0 


1.5 


11.4 


After full groupwise 


2.0 


1.0 


7.1 


Pairwise (same regime) 


2.1 


1.6 


11.4 



In order to evaluate the performance of the system, we compared the point 
positions obtained by transfering landmarks with those from a manual annota- 
tion of the 51 images. We measured the mean absolute difference between the 
found points and the equivalent curve on the manual annotation (see Figure 4). 
The results are summarised in Table 1. After the initial pairwise stage of the 
search (1100 warps) we obtain a mean accuracy of 2.0 pixels with an s.d. of 
1.5 pixels. Completing the groupwise phase (a further 2000 warps) does not im- 
prove the mean but tightens up the distribution considerably, reducing both the 
variance and the maximum error. For comparison we ran a purely pairwise reg- 
istration with the same number and distribution of additional random warps - 
the additional warps make little difference to the original pairwise result. 

6 Discussion 

We have presented a framework for establishing dense correspondences across 
groups of images using diffeomorphic functions and have demonstrated its appli- 
cation to two different domains. We have shown that in the case of the faces the 
groupwise method produces a more reliable registration than a purely pairwise 
approach. 

We have described one example of objective functions, warping functions 
and optimisation regime, which appear to give good results. There is consider- 
able research to be done investigating alternatives for each component of the 
framework. For instance, the groupwise function used above assumed diagonal 
covariance and may be improved with a full covariance matrix. Alternatively a 
statistical model of position, rather than derivatives, may lead to better results. 

Similarly, the relative weighting between shape and intensity terms. A, is 
somewhat arbitrary. For the pairwise case it is hard to select by anything other 
than trial and error. However, in the groupwise case, if the terms in the functions 
are related to log probabilities, it is possible to select Xgroup more systematically - 
in the experiments we used a value of ^ (there are 3 terms in the derivative 
vector for each element in the texture vector, and each term is normalised by its 
standard deviation). 

The methods described above extend directly into three (and higher di- 
mensions). The diffeomorphic warps of disks become warps of spheres (see Ap- 
pendix A) . We have used the techniques to register 3D MR images of the brain, 
and are currently evaluating the performance of the algorithms. 
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The general framework gives a powerful technique for registering images and 
for unsupervised shape and appearance model building. We anticipate it will 
have applications in many domains of computer vision. 
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Appendix A: Bounded Diffeomorphisms 

A useful class of bounded diffeomorphisms in arbitrary dimensions can be con- 
structed using the following equation, which warps space only within the unit 
ball, based on the displacement of the centre by a. 



fx + g(|x|)a (|x| < 1) 

X otherwise, 



( 7 ) 



where a is the position to which the origin is warped (|a| < 1) and g{r) is a 
smooth function satisfying the following properties: (/(O) = 1, (/(I) = 0, ^'(O) = 0, 
( 7 '( 1 ) = 0. /(x;a) is diffeomorphic providing that |a| < Ijdmax, where dmax = 
maxo<r<i| 5 ^(r)|. This function is bounded, so that it deforms space only within 
the unit disc (2D) or unit sphere (3D). In the 2D case, if g{r) = log(r^), 

then this is a Clamped Plate Spline with a single control point at the origin [20] . 
This function is guaranteed diffeomorphic provided that jaj < 0.25e, providing 
a family of bounded diffeomorphisms parameterised by the point to which the 
origin is warped (a). 

The simplest polynomial form for g{r) is g{r) = (1 — r^)^, which leads to 
an efficient implementation of the function in arbitrary dimensions. In this case 
we require jaj < 3-\/3/8 = 0.650 for a diffeomorphism. By combining with a 
suitable affine transformation we can generate diffeomorphisms that only affect 
a particular ellipsoidal region of space. 
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Abstract. In this paper we present an approach for separating two 
transparent layers in images and video sequences. Given two initial un- 
known physical mixtures, I\ and I2, of real scene layers, Li and L2, we 
seek a layer separation which minimizes the structural correlations across 
the two layers, at every image point. Such a separation is achieved by 
transferring local grayscale structure from one image to the other wher- 
ever it is highly correlated with the underlying local grayscale structure 
in the other image, and vice versa. This bi-directional transfer operation, 
which we call the “layer information exchange” , is performed on dimin- 
ishing window sizes, from global image windows (i.e., the entire image), 
down to local image windows, thus detecting similar grayscale structures 
at varying scales across pixels. We show the applicability of this approach 
to various real-world scenarios, including image and video transparency 
separation. In particular, we show that this approach can be used for 
separating transparent layers in images obtained under different polar- 
izations, as well as for separating complex non-rigid transparent motions 
in video sequences. These can be done without prior knowledge of the 
layer mixing model (simple additive, alpha-mated composition with an 
unknown alpha-map, or other), and under unknown complex temporal 
changes (e.g., unknown varying lighting conditions). 



1 Introduction 

The need to perform separation of visual scenes into their constituent layers 
arises in various real world applications (medical imaging, robot navigation, and 
others). This problem is challenging when the layers are transparent, thus gener- 
ating complex superpositions of visual information. The problem is particularly 
challenging when the mixing process is an unknown, spatially varying, non-linear 
function, as is often the case in real-world transparent scenes. 

A number of approaches to transparent layer separation have been proposed. 
Most of the approaches for separation of still images assume additive trans- 
parency with layer mixing functions which are uniform across the entire image 
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(e.g., [8,5,7]). Spatially varying functions were handled by [3] assuming spareness 
of image derivatives . In the case of video transparency (where the transparent 
layers have different relative motions over time), the underlying assumption is 
that dense correspondences can be pre-computed for each pixel in each layer 
across the entire sequence [9,12]. These methods are therefore restricted to scenes 
with simple 2D parametric motions, which are easy to compute under trans- 
parency and provide dense correspondences. Non-parametric correspondences 
are handled in [10] assuming stereo images. None of the above methods can han- 
dle complex non-rigid motions. Szeliski et al [9,10] further assume fixed mixing 
coefficients. 

In this paper we address the problem of separation of two arbitrarily su- 
perimposed layers (either in images, or in video), without any prior knowledge 
about the mixing process. We assume that two different combinations of the lay- 
ers (generated in an unknown fashion) are given to us, and use these to initiate 
the layer separation process. As will be shown later, two different combinations 
of layers are often available or otherwise easy to obtain in many real-world sce- 
narios, making this approach practical. 

Formally, and without loss of generality, we can phrase the problem as follows. 
Given two initial unknown physical mixtures, I\ and I 2 , of real scene layers, L\ 
and L 2 , produce approximations L\ and L 2 such that some separation criterion 
is satisfied. The two mixtures I\ and I 2 , can be generally defined as, 

Ii{i) = oii{i) ■ Li{i) + «2(*) • L2{i) 

h{i) = ■ Li{i) + /?2(*) • Liii) (1) 

where the index i denotes pixel position, and ai{i) ,a 2 {i) ,l3i{i) , and P 2 {i), are 
the unknown mixing functions (coefficients) which vary over pixel locations. In 
the simplest case, when the mixing is uniform and additive (as assumed in [9, 
5,8,7]), the mixing functions reduce to constant coefficients; Vz ai(z) = di, 
q; 2 (*) = C( 2 , /3i(z) = (3\ and /? 2 (*) = /32- In natural scenes, however, such con- 
ditions are frequently violated. Smoothly varying glass opacity, window dirt, or 
images acquired through polarization filters, can produce varying mixing coeffi- 
cients that vary over pixel locations. The formulation of Eq. (1) is general and 
captures a wide range of transparency models, including additive transparency 
with uniform mixing functions [9, 5, 8, 7], additive transparency with unknown 
alpha-matting (e.g., [12]), etc. 

Having two initial combinations, I\ and I 2 , generated in an unknown fashion, 
we seek a layer separation into representations of Li and L 2 which minimizes 
the structural correlations across the two layers at every image point. Such a 
separation is achieved by transferring local structure from one image to the other 
wherever it is highly correlated with the underlying local structure in the other 
image, and vice versa. This bi-directional transfer operation, which we call the 
“layer information exchange”, is performed on diminishing window sizes, from 
global image windows (i.e., the entire image) down to local image windows, thus 
detecting correlated structures at varying scales across pixel positions. 
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Two different initial combinations {I\ and I 2 ) are available, e.g., when two 
images of the same transparent scene are taken with different polarizers (as 
in [5,8]), or under different illuminations. However, our approach is not limited 
to those cases nor is it restricted to still imagery. When a single video cam- 
era records two transparent layers with different relative motions over time, and 
when the motion of only one of those layers is computable (e.g., a 2D parametric 
motion), then such initial layer separation is possible. This can be done even if 
the second layer contains very complex non-rigid motions (e.g., running water). 
Moreover, the layer mixing process is not known and can possibly change over 
time, and other unknown complex temporal changes may also occur simulta- 
neously (such as varying illumination and changing light reflections over time). 
Such examples are shown and discussed in the paper. 

This paper has three main contributions: (i) The idea of “layer information 
exchange” . (We also believe that this idea has applicability in disciplines of signal 
processing other than Computer Vision), (ii) To our best knowledge, this is the 
first time that video sequences containing non-rigid transparent motions have 
been separated (moreover, under unknown complex varying lighting conditions), 
(iii) Our approach provides a unified treatment to a wide range of transparency 
models, without requiring prior selection of the transparency model and the 
corresponding separation method. When the unknown mixing coefficients are 
spatially-invariant (i.e., only grayscale dependent, but independent of the pixel 
position), then our approach produces comparable results to Farid and Adelson’s 
ICA-based separation [5]. However, when the mixing coefficients are spatially- 
varying (unknown) functions, our approach performs better. Similarly, if the 
motions of both transparent layers in a video sequence are easy to compute, then 
our approach compares to existing methods for separating video transparency [9, 
12] . However, it performs better when one of the layers contains complex motions 
(such as non-rigid motions, 3D parralax) and other complex temporal changes. 

The rest of the paper is organized as follows. In Section 2 we identify an 
information correlation measure which is best suited for the underlying problem. 
In Section 3 we introduce our layer information exchange process, which is used 
for recovering the separate layers. In Section 4 we show the applicability of the 
method to transparency separation in still images and in video sequences. 



2 The Information Correlation Measure 

There are various commonly used measures for correlating information across 
images. In this section we review some of their advantages and drawbacks, and 
identify a measure which is best suited for the task at hand. 

The Mutual Information {MI) of two images (/ and g) captures the statisti- 
cal correlation (or co-occurrence) of their grayscales: MI{f,g) = H{f) H{g) — 
H{f, g), where H{f) is the entropy of the grayscale distribution in /, and H{f,g) 
is the joint entropy [4]. Mutual Information can account for non-linear grayscale 
transformations which are spatially invariant (i.e. transformations which depend 
only on the grayscale value at a pixel, but not on the pixel position). However, 
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(a) Original (b) Linear (c) Non-linear (d) Spatially varying 

image grayscale grayscale (position dependent) 

deformation deformation deformation 




fa fb^ {^) + 100 /c = fa fd{x, y) =(see caption) 



Measure 

(normalized) 


fa VS. fa 


fa vs. fb 


fa vs. fc 


fa VS. fd 


MI 


1.0000 


1.0000 


1.0000 


0.3426 


NGC 


1.0000 


1.0000 


0.8329 


0.8700 


GNGC 


1.0000 


1.0000 


0.9165 


0.9981 



(e) 



Fig. 1. Comparing different information correlation measures (a) Original 
image, (b) After a linear grayscale transformation, (c) After a nonlinear grayscale 
transformation, (d) After a spatially varying (i.e., position-dependent) grayscale trans- 
formation: fd{x, y) ~ fa- (sin(^^^) sin(^^) • 0.333-1-0.667), where rix x Uy is the image 
size, (e) Comparing the information correlation between the original image fa and the 
transformed images (/t, fa, fd) under different measures {NGC, MI, GNGC - see Sec- 
tion 2). As can be seen, GNGC correlates extremely well across all transformations. 



it cannot account for spatially varying grayscale transformations which are pixel 
position dependent (such as the spatially varying mixing functions of Eq. (1)). 
In other words, if / is an image obtained from / by some (non-linear) transfor- 
mation on the histogram of /, then MI{f, /) = MI{f, /) (see Fig. l.b and l.c). 
However, if / is obtained from / by some spatially varying (position-dependant) 
grayscale transformation, then the mutual information of / and / reduces signif- 
icantly: MI{f,f) «C M/(/, /), even though the geometric structures observed 
in / and in / are highly correlated (see Fig. l.d). 

A different widely used information correlation measure is the Normalized 
Gray-scale Correlation {NGC)-. NGC{f,g) = where C{f,g) = 

^ fj ' 9j ~ f ' 9 the covariance of / and g, N is the number of pix- 

els in / (/ and g are of the same size), f,g are the average grayscale values 
of f,g, and V{f) = ~ P the variance of /. NGC can account 

only for Zinear grayscale transformations which are spatially invariant (i.e., only 
changes in the mean and variance of the intensity - see Fig. l.b). Intuitively 
speaking, the normalized correlation (captured by NGC) can be regarded as a 
linear approximation of statistical correlation (captured by MI). 
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The above two measures require global grayscale correlations (whether nor- 
malized or statistical) . We next define an information correlation measure which 
requires only local correlations, and can therefore account for a wide variety 
of grayscale variation (linear and non-linear), including spatially-varying (i.e., 
position-dependant) grayscale transformations. This measure, which we will re- 
fer to as the Generalized NGC (GNGC) measure, is a weighted average of local 
NGG measures on small (typically 5x5) windows: 



GNGG{f,g) 



EliNGGHf,g)-mf)-VM) 



Ef=iCf(/,g) 

Ehv,if)-v.ig) 



where Gi{f,g) and NGCi{f,g) are, respectively, the local covariance and the 
local normalized correlation measure between two small corresponding windows 
(5x5) centered at pixels i in images / and g. In principle, one could define a 
similar global measure to that of Eq. (2) using a weighted sum of local MI mea- 
sures (instead of local NGG measures). However, there is not enough grayscale 
statistics in small 5x5 windows, which is why we resort to the local NGG 
measures. In case of color images, the sum is taken over all three color bands. 

The normalized weighted sum in Eq. (2) takes into account the correlations 
of small corresponding windows across / and g. These are weighted according to 
their reliability, which is measured by the grayscale variances in the local (5 x 5) 
windows. This captures correlations of small geometric features (under different 
grayscale transformations) without introducing numerical instabilities which are 
common to regular normalized correlation in small windows. Prominent geomet- 
rical features in the image are characterized by large local gray-scale variances 
and therefore contribute more to the global correlation (GNGG) measure, while 
fiat gray-scale regions have small local grayscale variances, hence small weights. 

Unlike the MI measure, the GNGC measure (Eq. (2)) captures also the 
statistical correlations between geometric structures in the image. It can there- 
fore account for spatially varying non-linear grayscale transformations, such as 
the one showed in Fig. l.d, whereas MI cannot. The reason for this difference 
between the two measures, is that MI requires global statistical correlation of 
grayscales across the two images (a condition which is violated under spatially- 
varying grayscale transformations), whereas GNGC requires only ZocaZ statistical 
correlation across the two images (but at every 5x5 window in the image) . Sim- 
ilar measures to the GNGC measure have been previously used for other tasks 
where correlation between geometric structures was needed (e.g., for multi-sensor 
alignment [6]), although in the past a regular integration of local correlation val- 
ues for those tasks was typically used, whereas our global measure is a weighted 
sum of the local measures. This modification is crucial to the stability of the 
layer separation process. 

Because GNGC captures correlations of meaningful geometrical structures, 
it is therefore more suited for the problem at hand. Moreover, the GNGC mea- 
sure is easy to differentiate in order to derive an analytic solution to the layer 
separation problem, as will be shown in Section 3. 
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Fig. 2. The Layer Information Exchange, (a)-(b) The initial mixtures 7i and 72. 
(c) Different values of a produce different degrees of information correlation between 
images I 2 and 7i— a-72. (d)-(h) Examples of 7i— a-72 for various values of a. ’’Fountain” 
decreases until at cr = 0.667 it disappears completely, and when cr is increased further, it 
becomes negative and the GNGC increases again, (i)-(j) The recovered layer separation 
using the algorithm described in Section 3.1. 



3 The Layer Information Exchange 

Let I\ and I2 be two different combinations of two unknown layers Li and L2, 
obtained in an unknown fashion (i.e., the coefficients ai(f), a2(*), /?i(*) and 
/?2(*) in Eq. ( 1 ) are unknown, spatially varying, non-linear mixing functions). 
We will obtain a separation of 7 i and I2 into two layers L\ and L2 (which 
are visual representations of L\ and L2) by transferring information from 7 i to 
I2, and vice versa, until the structural correlation between those two images is 
minimized. The information transfer is performed at different information scales, 
ranging from the entire image to small image windows. To explain this concept 
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of “layer information exchange” , let us first examine the simpler case of uniform 
mixing functions (i.e., constant unknown coefficients). We will later relax this 
assumption, and show how the process is generalized to spatially-varying non- 
linear mixing functions. 



3.1 Handling Uniform Mixing Functions 

Assuming uniform mixing functions, then Eq. (1) reduces to: 

h{i) = ai ■ Li{i) + U 2 ■ L 2 {i) , hii) = ! 3 i ■ Li{i) + (32 ■ L 2 {i) (3) 

There exists a constant scalar cr such that Li{i) = Ii{i) — al 2 {i) will contain only 
the geometric structure of Li{i), without any trace of L2{i). For example, ^ ^ 

will lead to such a layer separation: Li{i) = Ii{i) — ^l2{i) = (oi — a2^)Li{i). 

Namely, Li{i) is recovered up to a constant scale factor {a\ — a2^)- However, 
since a\, 02 , (3i and (32 are not known, the transfer factor cr is also unknown. 

We do know, however, that for the correct transfer factor a, the layer L 2 {i) 
will disappear in Li(i), thus minimizing the structural correlation between 
Li{i) = Ii{i) — ( 7 / 2 ( 1 ) and l2{i)- This is visually shown in Fig. 2. We can there- 
fore recover the transfer factor a (and accordingly the layer L\, up to a scale), 
by minimizing the following objective function: 

a = argmin{GNGC{l2,Ii — fjl2)) ( 4 ) 

Plugging in the definition of GNGG from Eq. (2), results in an objective function 
which is quadratic in a. Differentiating the above objective function with respect 
to a and equating to zero (i.e., ^GNGG{l 2 ,Ii — cr/ 2 ) = 0), yields an analytic 
expression for a: 

" ® 

where Gi and Vi are the local (5 x 5) covariances and variances as defined in 
Section 2. Having computed the transfer factor cr, we can recover the first layer 
(up to a scale): 

L\ = I\ — al2, 

and proceed to computing the second layer in the same way. The second layer 

L2 = I2 — t/Ti, 

is recovered by seeking 77 which minimizes GNGC{Li, I2 — gLi). In practice, we 
repeat this process a few times (typically 2 to 3 times), to obtain cleaner layer 
separation. At each iteration, the previously recovered L\ and L2 serve as the 
new mixtures. Namely, = ^2 - where k 

is the iteration number. 

We refer to the above procedure as the “layer information exchange” , because 
at each step we transfer some portion of one image to the other. For example. 
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the step L\ = I\ — al 2 transfers some portion of I 2 to/from I\ (depending on 
whether a is negative/positive). In the next step, a different portion of the new 
image Li is transferred in the other direction, according to the magnitude and 
sign of rj. Fig. 2.i and 2.j show the two layers recovered from images I\ and I 2 
(Fig. 2. a and 2.b) by applying the above information exchange procedure. 



3.2 Generalizing to Spatially Varying Mixing Functions 



So far we have assumed that the mixing coefficients ai(i), a 2 {i), /3i(i) and (52{i) 
are constant. However, in most real-life scenarios, this is not true. To solve the 
separation problem for the case of spatially-varying mixing functions, we assume 
that if we use a small enough window Wi around a pixel i, then within that region 
of analysis the mixing coefficients are approximately uniform (although different 
from the mixing coefficients in other nearby pixels). In other words, the global 
layer exchange procedure described in Section 3.1 can be applied to a small local 
region of analysis Wi to compute a{i) and r]{i) at the corresponding pixel i. These 
transfer factors are repeatedly computed for each pixel i = 1..N , using a window 
Wi centered around each image pixel. This results in a spatially-varying layer in- 
formation exchange: Li(i) = Ii{i) — a{i)l 2 {i), and I/ 2 (*) = hii) — vi'i)Li(i). This 
procedure is repeated iteratively: = L^(z) — cr^+^(f)L 2 (*) j = 

L^ii) — rj'‘~^^{i)L’l~^^{i) until iT^{i) and 77 ^( 1 ) are small enough (where k is the 
iteration number). 

Note that we are now dealing with two different types of local image windows: 
(i) the local region of analysis Wi used of the piece-wise approximation of the 
mixing functions, and (ii) the small 5x5 window (mentioned in Section 2), 
which is used for obtaining local measurements (local NGC) to be summed for 
generating the global GNGC measure. These 5x5 windows are the smallest 
reliable information elements over which the local NGG measures are computed 
across the two images (regardless of whether the mixing functions are uniform 
or not). These local measures are then summed within the region of analysis, 
which is the entire image for the case of uniform mixing functions, and smaller 
Wi in the case of spatially varying mixing functions. 




Fig. 3. Handling spatially varying mixing functions, (a)-(b) The two mixtures 
7i and I 2 were obtained by mixing two images (’’fountain” and ’’waterfall”) with 4 
different non-linear functions (oi was a sinus, 02 and j3i were two exponent functions, 
and /?2 was a constant function), (c)-(d) The recovered transparent layers using our 
global-to-local layer separation method described in Section 3.2 . 
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Since we do not know ahead of time the degree of non-linearity of the mix- 
ing functions, the above local procedure is repeated using coarse-to-fine (i.e., 
large-to-small) regions of analysis Wi. We start the iterative process with Wi 
being the entire image. This compensates for the case of globally uniform mix- 
ing functions (i.e., constant coefficients throughout the entire image). We then 
gradually decrease the window size Wi to smaller and smaller windows (but not 
below 15 X 15, for numerical stability). This gradual process is aimed to assure 
that the resulting mixing functions remain as smooth as possible, whenever a 
smooth solution is a valid interpretation. Fluctuations from uniform/contant 
mixing functions occur when there is no simpler interpretation. 

Fig. 3 shows an example of applying the above procedure to the pair of 
mixtures /i and I 2 (Figs. 3. a and 3.b). These images were generated with spa- 
tially varying non-linear mixing function/coefficients (see figure for more de- 
tails). Fig. 3.C and 3.d show the resulting separation obtained using our layer 
information exchange (without prior knowledge of the spatially varying mixing 
coefficients, of course). It has been able to completely separate the structures of 
the two layers. 

4 Applications 

The information exchange approach assumes that two different initial combina- 
tions (/i and I 2 ) of the unknown transparent layers (Li and L 2 ) are available, 
but the way in which Ii and I 2 were generated from Li and L 2 is not known, 
and can be very complex. In this section we explore some cases where such 
initial combinations are readily available or else easy to extract, and show the 
applicability of our layer exchange approach for addressing these cases. 




Fig. 4. Recovering Transparent Layers from Polarized Images (a)-(b) Two 
real images obtained under different polarizations, showing the reflection of Sheila in 
a Renoir picture. (The images were taken from Farid [5].) (c)-(d) The recovered 

transparent layers using our layer separation method. 



4.1 Separating Layers in Polarized Images 

Due to the physical nature of light polarization through reflecting and transmit- 
ting surfaces, two superimposed transparent layers differ in their polarization. 
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Fig. 5. Separating non-rigid transparencies in video. Column (a) Five frames 
from the input movie (see text for details). Columns (b)-(c) The initial separation 
acquired by extracting the median image from the aligned sequence. Columns (d)-(e) 
The recovered layered. The residual traces of the woman which were visible in (b) are 
removed in (d), the true color of the fountain is recovered, and the temporal variations 
in the indoor illumination are recovered in (e). The video sequences can be viewed at 
http:/ / WWW. wisdom. weizmann.ac.il/~vision/TrasnparentLayers.html. 



Different mixtures {I\ and I 2 ) of transparent scene layers can be obtained by 
changing the angle of a polarization filter in front of the camera (as in [5,8]). 
Fig. 4 shows the result of applying our algorithm to a real pair of images of the 
same scene obtained with different polarizers. (These results are comparable to 
those of [5].) 

4.2 Separating Non-Rigid Transparent Layers in Video 

When a video camera records two transparent layers with different relative mo- 
tions over time, and when the motion of one of those layers is easy to compute 
(e.g., if it is a 2D parametric motion), then such a layer separation is possible. 
This can be done even if the second layer contains very complex non-rigid mo- 
tions (such as flickering fire, running water, walking people, etc.), the mixing 
process is not known and may be spatially varying (e.g., due to varying glass 
opacity or window dirt), and other temporal changes may occur simultaneously 
(such as varying illumination over time). 

Such examples are shown in Fig. 5 (a simulated example) and in Fig. 7 (a real 
example). Fig. 5 shows a simulated example of an indoor scene with motion and 
varying illumination, reflected in a window through which a dynamic outdoor 
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scene is visible. The input video sequence was generated by superimposing two 
video sequences: (i) an “indoor scene” video, showing a woman’s head moving 

while the illumination changes over time (dimming and brightening of indoor 
illumination), reflected in a window, and (ii) an outdoor scene of a fountain 
displaying highly non-rigid complex motion, with changing specular reflections, 
etc. The left column of Fig. 5 displays some representative frames from the 
generated sequence. The woman’s reflection is more visible when the illumination 
is darker, and is less visible when the illumination is brighter. The goal here was 
to separate this generated sequence into its original two layers (sequences, in this 
case): the outdoor scene (the fountain) with all its dynamics and specularities, 
and the indoor scene (the woman) with its motion and changing illumination. 

In this case we have only one input (the video sequence of Fig. 5. a). To obtain 
two different initial layer mixtures (/i and I 2 ), we did the following: The woman’s 
motion is a simple 2D parametric motion, which can be computed using one of 
the dominant motion estimation methods (e.g., [1,2,9]). This brings the woman 
into alignment. Now, using Weiss’ method for extracting intrinsic images [11], 
we apply it to the aligned sequence. This process recovers a median image of the 
woman, and a residual image for each frame after removing the median image 
of the woman. These are displayed in the second and third columns of Fig. 5 
(after rtnwarping the images to their original coordinate system according to the 
estimated 2D motion of the woman). Because the process of [11] results in a 
single intrinsic image, it does not capture any temporal changes. As a result, 
the woman’s sequence in Fig. 5.c does not contain any of the changes in indoor 
illumination, and the “residual” sequence (Fig. 5.b) still contains a small residue 
of the woman (sometimes dark, sometimes bright), while the true colors of the 
fountain are lost. 

Each pair of images in the second and third columns of Fig. 5 can be re- 
garded as initial layer mixtures Ii and I 2 (unknown and non-linear) for that 
time instance. These sequences (Ii and I 2 ) are fed as the initial combinations 
to our layer exchange process. Results of the layer separation process are dis- 
played in the last two columns of Fig. 5. Note that now the fountain sequence is 
fully recovered, with its true colors and no traces of the woman (Fig. 5.d), while 
the true changes in indoor illuminations have been recovered and automatically 
associated with the indoor woman sequence (Fig. 5.e). 

The initial separation into a “medians” and “residuals” forms the initial 
mixtures I\ and I 2 above. The (unknown) mixing functions which relate I\ 
and I 2 to the original (unknown) layers (see Eq. (1)), cannot be assumed to be 
constant or position invariant. This is because the median operator is non-linear. 
Our Information Exchange approach handles this well (see Figs. 5.d and 5.e). 
However, the ICA-based separation [5,13] does not perform well on these I\ and 
I 2 , as can be seen in Figs. 6.c and 6.f. This is because it is not suited for the 
case of non-uniform spatially varying coefficients. 

To our best knowledge, this is the first time videos containing non-rigid 
transparent motions have been separated (and moreover, under unknown varying 
lighting conditions). Current approaches for video transparency separation (e.g., 
[9,10,12]), assume that each layer moves rigidly, since dense correspondences 
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Initial mixtures Layer-exchange separation ICA separation 




(d) (e) (f) 



Fig. 6. ICA vs. Information Exchange separation We compare results of 
applying the ICA-based separation [5,13] to our layer-based separation displayed in 
Fig. 5. ICA was applied to the same initial mixture sequences (the “median” and the 
“residual” images in Figs. 5.b and 5.c). Almost all of the resulting frames displayed 
wrong separation. One such example is shown in the third column of this Egure ((c) 
and (f)). For comparison, we display the corresponding frames of the initial mixture 
images (a) and (d), and our separation result (b) and (e). 



of both layers across the sequence need to be recovered in those methods. We 
currently need to compute only one of the motions, allowing the second motion 
to be arbitrarily complex. 

Fig. 7 shows a real example of video transparency with non-rigid motions 
and changing effects of illumination. In this case, a still video camera recorded 
a scene with non-rigid human motions reflected in a swivelling glass door of an 
entry hall to a building. The reflected outdoor scene therefore appears moving, 
while the indoor scene is static. At the last part of the sequence, due to a strong 
reflections of light in the glass, the AGC (Automatic Gain Gontrol) of the camera 
induced fluctuating changes in the dynamic range of the image. The left column 
of Fig. 7 displays three representative frames from the recorded sequence. As 
before, we used Weiss’ method [11] for extracting the intrinsic image from the 
sequence. The median image was then removed from the sequence, producing a 
“residual” sequence. These were used as the initial combinations (/i and I 2 ) for 
our layer exchange approach. The resulting separation into layers is displayed 
in the second and third columns of Figs. 7. The reflected scene was separated 
from the glass door, and the changing effects of illumination due to the change 
in aperture have also been recovered. 

5 Conclusions 

We presented an approach for separating two transparent layers through a pro- 
cess termed the “layer information exchange” . Given two different (unknown 
complex) combinations of the layers, we recover the layers by gradually trans- 
ferring information from one image to the other, until the structural correlation 
across the two images is minimized. The information transfer is done at different 
information scales, ranging from the entire image to small image windows. 
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(a) Input sequence (b) Recovered Layer 1 (c) Recovered Layer 2 




Fig. 7. Separating non-rigid transparencies in video (a) Three frames from a 
real video sequence of the entrance hall of a building recorded through the build- 
ing’s swivelling glass door. The outdoor scene (including a running man and the 
camera tripod) are reflected from the swivelling door. The indoor scene includes 
a statue and a plant. (b) The first recovered layer (the outside scene). (c) 
The recovered interior hall with the statue. The video sequences can be viewed at 
http: / / WWW. wisdom. weizmann.ac.il/~vision/TrasnparentLayers.ht ml. 



We showed the applicability of this approach to various real-world scenarios, 
including image and video transparency separation. To our best knowledge, this 
is the first time that complex non-rigid transparent motions in video have been 
separated, without any prior knowledge of the layer mixing model, and under 
unknown complex temporal changes. We further showed that our approach to 
layer separation does equally well to ICA (Independent Component Analysis) 
when the mixing functions are spatially fixed (i.e., independent of the pixel po- 
sition). However, when the mixing functions are more realistic spatially varying 
functions (i.e., vary as a function of pixel position), then our approach performs 
better than ICA. We believe that the applicability of this approach goes be- 
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yond analysis and separation of image layers, and can possibly be applied to 

separating other types of signals (such as acoustic signals, radar signals, etc.) 
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Abstract. We propose a multiple classifier system approach to object recognition 
in computer vision. The aim of the approach is to use multiple experts successively 
to prune the list of candidate hypotheses that have to be considered for object inter- 
pretation. The experts are organised in a serial architecture, with the later stages 
of the system dealing with a monotonically decreasing number of models. We 
develop a theoretical model which underpins this approach to object recognition 
and show how it relates to various heuristic design strategies advocated in the 
literature. The merits of the advocated approach are then demonstrated experi- 
mentally using the SOIL database. We show how the overall performance of a 
two stage object recognition system, designed using the proposed methodology, 
improves. The improvement is achieved in spite of using a weak recogniser for the 
first (pruning) stage. The effects of different pruning strategies are demonstrated. 



1 Introduction 

There are several papers [4, 14, 15, 16, 17,8, 1 1 ,9] concerned with multiple classifier system 
architectures suggesting that complex architectures, in which the decision process is de- 
composed into several stages involving coarse to fine classification, result in improved 
recognition performance. In particular, by grouping classes and performing initially 
coarse classification, followed by a fine classification refinement which disambiguates 
the classes of the winning coarse group, one can achieve significant gains in perfor- 
mance. [9] applies this approach to the problem of handwritten character recognition 
and suggests that class grouping should maximise an entropy measure. Similar strategies 
have been advocated in [4,14,15,16,17]. The popular decision tree methods can be seen 
to exploit the same phenomenon. 

The aim of this paper is to demonstrate that these heuristic processes do have a 
theoretical foundation. We propose a framework for analysing the benefit of hierarchical 
class grouping. Using this framework we develop a theoretical basis for multiple expert 
fusion in serial coarse to fine object recognition system architectures. The analysis will 
suggest and explain a number of strategies that can be adopted to build such architectures. 

We apply the proposed design methodology to the problem of 3D object recognition 
using 2D views. This problem has been receiving a lot of attention over the last two 
decades, resulting in a spectrum of techniques which exploit, for instance, colour [20, 
5,12], shape [2,19] and object appearance [18,10,13]. Although none of the existing 
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methods provide a panacea on their own, we argue that a combination of several object 
recognition techniques can be very effective. 

More specifically we demonstrate that the proposed approach accomplishes a se- 
quential pruning of the list of object model hypotheses, with the later stages of the 
system having to deal with a monotonically decreasing number of models. The merits of 
the advocated approach are then demonstrated experimentally using the SOIL database. 
We show, how the overall performance of a two stage object recognition system based 
on the expounded principles improves. The improvement is achieved in spite of using a 
weak recogniser for the first (pruning) stage. The effects of different pruning strategies 
are demonstrated. 

The paper is organised as follows. In Section 2 the problem of object recognition using 
hierarchical class grouping is formulated. We derive an expression for the additional 
decision error, over and above the Bayes error, as a function of estimation error. In 
Section 3 we discuss various model pruning strategies that naturally stem from this 
analysis. In Section 4 one of these strategies is applied to the problem of 3D object 
recognition using a two stage decision making system. Section 5 draws the paper to 
conclusion. 

2 Mathematical Notation and Problem Formulation 

Consider an object recognition problem where object Z is to be assigned to one of 
m possible models {wj, z = 1, ...m}. Let us assume that the given scene object is 
represented by a measurement vector, x. In the measurement space each object category 
u)k is modelled by the probability density function p(x|wfc) and let the a priori probability 
of object occurrence be denoted by P{ujk)- We shall consider the models to be mutually 
exclusive which means that only one model can be associated with each instance. 

Now according to the Bayesian decision theory, given measurements x, the instance, 
Z, should be assigned to model class ojj, i.e. its label 6 should assume value 6 = ujj, 
provided the aposteriori probability of that interpretation is maximum, i.e. 

assign 9 — >■ toj if 

P(0 = Wjjx) = max P(6* = Wfc|x) (1) 

k 

In practice, for each interpretation, a decision making system will provide only an 
estimate P(wi|x) of the true aposteriori class probability P(wi|x) given measurement 
X, rather than the true probability itself. Let us denote the error on the estimate of the i*^ 
model class aposteriori probability at point x as e(wi|x) and let the probability distri- 
bution of errors be pi[e{uji\x.)]. Clearly, due to estimation errors, the object recognition 
based on the estimated aposteriori probabilities will not necessarily be Bayes optimal. In 
the appendix we derive the probability, es(x) of the decision relating to object x being 
suboptimal, and refer to it as the switching error probability. We shown in (14) that 
this probability primarily depends on the margin APsi{x) = P(ws|x) — P(wi|x) be- 
tween the aposteriori probabilities of the Bayes optimal hypothesis ujg and the next most 
probable model Wj, as well as on the width (variance) of the distribution of estimation 



error. 
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Now how do these labelling errors translate to recognition error probabilities? We 
know that for the Bayes minimum error decision rule the error probability at point x 
will be 6 b (x). If our pseudo Bayesian decision rule, i.e. the rule that assigns patterns 
according to the maximum estimated aposteriori class probability, deviates from the 
Bayesian rule with probability es(x), the local error of the decision rule will be given by 

a(x) = es(x)[l - es(x)] + es(x)[l - es(x)] (2) 

The error, a(x), will be close to Bayesian only if es(x) is negligible. Thus we want 
the label switching error to be as small as possible. 

Conventionally, the multiple classifier fusion paradigm attempts to ameliorate the 
switching error probability by reducing the variance of estimation errors. This is achieved 
by combining multiple estimates obtained by a number of diverse object recognition 
experts. In this paper we adopt a completely different approach that strives to increase 
the margin between the posteriors of the competing model hypotheses in order to reduce 
the error probability es(x) by alternative means. The basic idea is to group models into 
superclasses in such a way that the margin between the posteriors of the resulting model 
sets widens. The number of groups is a free parameter. For our purposes we divide the 
classes into two groups and perform a coarse classification of the input pattern to one 
of these two groups. Then, in the next stage, we refine the classification and continue 
dividing the the most probable super class in the two subsets by considering the remaining 
alternatives. 

In general, there will be m hypotheses that can be grouped hierarchically into two 
groups at each stage of the hierarchy. Let us denote the two groups created at stage k 
by and 17^. The set 17* will be divided in the next stage into two subsets, and so on. 
Thus the class sets 17* will satisfy 

f?*el7^' j <k (3) 

Further, let us denote the probability of classifying measurement vector x from superclass 
17* suboptimally by lu* (x) . Referring to (14), in this two (super)class case, the switching 
error probability at stage k is given simply by 

pOO 

w*(x) = / p[r]ak{x)]d'nnk{x) (4) 

(x) 

where AP^k (x) is the margin between the posteriors of thw two super classes at stage k 
and riQk (x) is the associated estimation error. Assuming that the Bayes optimal hypoth- 
esis is contained in set 17*, it will end up in superclass 17*+^ with probability l-w*(x) 
Similarly, at the (fc + 1)®* stage the probability of making a suboptimal decision is 
■u;*+^(x), while the Bayes optimal decision will be made with probability 1 — w*“*'^(x). 
The complete n-stage hypothesis refinement process is illustrated in Figure 1 . By refer- 
ence to Figure 1 the total switching error probability of the hierarchical decision making 
process can be written as 

n—1 

es(x) = w^(x) -t- - w^(x))]w*(x) 

i^2 

+ [77-1(1 -^.(X))]«;"(x) 



( 5 ) 
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egCx) 



Fig. 1. The total probability of label switching, es (x) in a coarse to fine multistage object recog- 
nition system 



Note that the final stage will involve only the closest competitors, classes 17" = 
{ws ,uji}. The probability w" (x) of label switching will be given by 

aOO 

= / p[r]u;{x)]dr]^{x.) ( 6 ) 

J AQsi{x) 

where ri^{x.) = 2e(wi|x) is the combined estimation error for the two posteriors, since 
in the two class e(ws|x) = — e(o;i|x). 

In equation (6) we denote the aposteriori probabilities, Q(u;r.|x), r = s,i for model 
classes ujs and oji by different symbols to indicate that these functions differ from 
P{ujr\^), r = by a scaling factor P(o;s|x) + P{uji\x.) since they have to sum 
up to one. Note that if functions Q(i^r|x) are estimated via probability densities, the 
estimation errors will be scaled up versions of the original errors e(wr|x). However, if 
these functions are estimated directly from the training data, the errors will be different 
and can be assumed to have the same distribution as the original unsealed errors e{ujr |x) . 
If this is the case, then one can see why this two stage approach may produce better re- 
sults. The probability mass under the tail of the error estimation distribution will rapidly 
decay as the margin (the tail cut off) increases. If the error distributions are the same but 
the margins increase by scaling, the probability of label switching will go down. 

3 Discussion 

Let us consider the implication of expression (5). Assuming that the estimation errors 
have identical distribution at all the stages of the sequential decision making process, 
the label switching error w®(x) at stage i will be determined entirely by the margin 
(difference) between the aposteriori probabilities of classes P(l7®|(x)) and P(l7*|(x)). 
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By grouping model classes at the top of the hierarchy we can increase this margin and 
therefore control the additional error. In this way we can ensure that the additional errors 
u;*(x) in all but the last stage of the decision making process are negligible. In the 
limiting case, when w*(x) —>-0, i = 1, n — 1 the switching error es(x) will be 
equal to ia"(x). At that point the set 17" is likely to contain just a single class. Thus the 
last stage decision will involve two classes only. Note that whereas the margin between 
the aposteriori probabilities of the two model classes, say Wg and Wi, at the top of the 
hierarchy, was P(o's|x) — P{uji\x), in the last stage, it will become 

^ P(u;g|x) -P(g;,|x) 

P(wg|x)+P(w,|x) 

Thus the margin will be signihcantly magnihed and consequently the additional error 
es(x) signihcantly lower than what it would have been in a single stage system. 

The expression (5) immediately suggests a number of grouping strategies. For in- 
stance, in order to maintain the margin as large as possible in all stages of the decision 
making process it would clearly be most effective to group all but one class in one super 
class and the weakest class in the complement super class. This strategy has been sug- 
gested, based on heuristic arguments, in [21]. The disadvantage of this strategy is that it 
would involve m — 1 decision steps. 

Computationally more effective is to arrive at a decision after log 2 m steps. This 
would lead to grouping which maintains a balance of the two class sets f7* and 17*. 
Another suggestion [9] is to split the classes so as to minimise an entropy criterion. 
However, all these strategies exploit the same underlying principle embodied by our 
model. 

4 Experimental Results 

In this section we illustrate the merits of model grouping within the context of 3D object 
recognition. The core of our obj ect recognition system is a region-based matching scheme 
proposed in [1]. In this method an object image is represented by its constituent regions 
segmented from the image. The regions are represented in the form of an Attributed 
Relational Graph (ARG). In this representation each region is described individually 
and by its relation with its neighbouring regions expressed in terms of binary measure- 
ments. We use the representative colour of each region as its unary measurement and we 
characterize geometric relations between region pairs using binary measurements. 

The matching process is performed using probabilistic relaxation labelling [3]. In this 
approach, for each region from a test image, we compute the probability that the region 
corresponds to a particular node of the ARG representing the combined set of object 
models. We model an object using an image taken from the frontal view. The label 
probabilities for a region in the test image are initialized by measuring the similarity be- 
tween the unary measurements corresponding to the two regions being matched. These 
probabilities are then updated by taking into account the consistency of labelling at the 
neighbouring regions. 

We tested the idea of grouping from two different aspects: label and model grouping. 
By label grouping we mean that for each region in the test image we classify the union 
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of labels associated with the regions in the database into two sets: candidate labels and 
rejected labels. The process of region matching then proceeds using only the set of 
candidate labels. We propose two label pruning schemes: pruning at the initialization 
stage and at the end of each iteration of the relaxation labelling. At initialization, for 
each region in the test image we compile a list of candidate labels. This list is based 
on the degree of similarity between unary measurements. At the end of each iteration 
of the relaxation labelling process we note the label probabilities associated with each 
region in the test image and drop those labels whose probabilities are below a predefined 
threshold. 

Model grouping realizes the same idea at a higher level. Let us consider our model- 
based recognition system as a set of serial classifiers where each classifier is in fact 
an object recognition expert. Each expert takes the list of model candidates from the 
previous expert and delivers a pruned list of model candidates to the next classifier. The 
pruning of models is performed by matching the test image features against features 
extracted from the model candidates. The objective for the last expert is to select the 
winning candidate. 

In a simple case of this scenario we consider just two recognition experts in a tandem. 
The first expert performs a course grouping of the object hypotheses based on an entropy 
criterion. This initial classification is performed using colour cues. We opt for the MNS 
method of Matas et al [12] for this purpose. In this method the colour structure of 
an image is captured in terms of a set of colour descriptors computed on multimodal 
neighbourhoods detected in the image [12]. We use the similarity between the descriptors 
from the scene image and each of the m object models to find the aposteriori probabilities 
of the object in the scene image belonging to the various model classes in the database. 

Having provided the set of a posteriori probabilities V = {p{uji\x),\/i G {1 • • • m}}, 
we rank them in the descending order. Our objective is to compile a list of hypothesised 
objects based on their likelihood of being in the scene (V). For this purpose we use the 
entropy of the system as a criterion. Let us consider the list, 17, of model hypotheses 
arranged according to the descending order of their probabilities. If Q is split into two 
groups 17^ and 17^ comprising the K most likely objects in the scene and the remaining 
objects in the database respectively, the entropy of the system is evaluated as follows [7]: 

E = aE{n^) + {l-a)E{Q^) (8) 

where E(f7^) and E{f2^) are the entropies associated with groups 17^ and 17^ respec- 
tively and a is the probability that the present object in the scene exists in the group 17^ . 
By searching the range of possible configurations, (r = 1 • • • m), the grouping with the 
minimum entropy is selected and the group of the hypothesised objects, 17^, is passed to 
the next expert. The second expert is the ARG matcher[l] described earlier. The whole 
recognition system is referred to as the MNS-ARG method. 

We designed two experiments to demonstrate the effect of both label pruning and 
model pruning on the performance of the ARG method. We compared three recognition 
systems from the recognition rate point of view: ARG with/without label pruning and 
MNS-ARG (with label pruning). The experiments were conducted on the SOIL-47 (Sur- 
rey Object Image Library) database which contains 47 objects each of which has been 
imaged from 21 viewing angles spanning a range of up to ±90 degrees. Figure 2 shows 
the frontal view of the objects in the database. The database is available online[6]. In 
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Fig. 2. The frontal view of some objects in the SOIL47 database 




Fig. 3. An object in the database imaged from 20 viewing angles 



this experiment we model each object using its frontal image while the other 20 views 
of the objects are used as test images (Fig. 3). The size of images used in this experiment 
is 288 X 360 pixels. 

In the first experiment we applied the ARG matching for two different cases: with 
label pruning and without label pruning. The recognition performance for these two cases 
is shown in Fig. 4. As can be seen, the performance of the ARG matching is considerably 
enhanced by label pruning. It is worth noting that as this experiment showed the label 
pruning also speeds up the process of the relaxation labelling significantly. 









Multiple Classifier System Approach to Model Pruning in Object Recognition 



349 




Fig. 4. The percentage of correct recognition for the ARG and the MNS-ARG methods 




pose 

Fig. 5. The likelihood of the correct model being in the list of hypothesised objects generated by 
the MNS method 



In the second experiment, for each test image we applied the MNS method to deter- 
mine the hypothesised objects matched to it. The results of the experiment are shown in 
Fig. 5. In this figure we plot the percentage of cases in which the list of hypothesised 
objects includes the correct model. This rate has been shown as a function of the object 
pose. For comparison we plot the percentage of cases in which the correct object has the 
highest probability among the other candidates. It is referred to as the recognition rate. 
The results illustrate that the recognition rate for the MNS method is not very high. This 
is not surprising as many grossery items contain similar surface colours. 
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In contrast, as seen from Fig. 5 in the majority of cases the hypothesised list includes 
the correct object. It is worth noting that the average size of the list of hypothesised 
objects is 16 which is near to one third of the database size(47 objects). 

The ARG method was then applied to identify the object model based on the list of 
hypothesised objects generated by the MNS method. This recognition procedure was 
applied to all test images in the database. In Fig. 4 we have plotted the recognition rate 
for the MNS-ARG method as a function of object pose. For comparison we have shown 
the recognition rate when ARG method is applied as a stand alone expert. As a base line 
we added the rate of correct classihcation of the MNS method. The results show that 
the object grouping using the MNS method improves the recognition rate particularly 
for extreme object views. For such views the hypotheses at a node of the test graph do 
not receive a good support from its neighbours (problem of distortion in image regions). 
Moreover a large number of labels involved in the matching increases the entropy of 
labelling. When the number of candidate labels for a test node declines by virtue of 
model pruning the entropy of labelling diminishes. Consequently it is more likely for a 
test node to take its proper label (instead of the null label). 

Similar to label pruning, the grouping using the MNS method not only gains the 
recognition rate but also it reduces the computational complexity of the entire recognition 
system. This experiment showed that the MNS-ARG method can be performed almost 
three times faster than the stand alone ARG method. 

5 Conclusion 

We proposed a multiple classifier system approach to object recognition in computer 
vision. Multiple experts are used successively to prune the list of candidate hypotheses 
that have to be considered for object interpretation. The experts are organised in a serial 
architecture, with the later stages of the system dealing with a monotonically decreasing 
number of models. We developed a theoretical model which underpins this approach to 
object recognition and show how it relates to various heuristic design strategies advocated 
in the literature. The merits of the advocated approach were then demonstrated on a two 
stage object recognition system. Experiments on the SOIL database showed worthwhile 
performance improvements, especially for object views far from the frontal, which was 
used for modelling. The improvements were achieved in spite of using a weak recogniser 
for the hrst (pruning) stage. The benehcial effects of different pruning strategies were 
demonstrated. 
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Appendix: Probability of Suboptimal Decision Making 

In order to investigate the effect of estimation errors on decision making, let us examine 
the class aposteriori probabilities at a single point x. Suppose the aposteriori probability 
of class LOs is maximum, i.e. P(ws|x) = max™ P(wi|x) giving the local Bayes error 
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eB(x) = 1 — P(a;s|x). However, our classifier only estimates these aposteriori class 
probabilities. The associated estimation errors may result in suboptimal decisions, and 
consequently in an additional recognition error. To quantify this additional error we 
have to establish what the probability is for the recognition system to make a suboptimal 
decision. This situation will occur when the aposteriori class probability estimates for 
one of the other model classes becomes maximum. Let us derive the probability esi (x) 
of the event occurring for class uji, i ^ s, i.e. when 

P(wi|x) - P(wj|x) > 0 VjV * (9) 

Note the left hand side of (9) can be expressed as 

P(wi|x) — P(wj|x) + e(wi|x) — > 0 (10) 

Equation (10) dehnes a constraint for the two estimation errors e(u;fe|x), k = i,j as 

e{uji\x.) - e{ujj\x.) > P(wj|x) - P{cOi\x.) (11) 

The event in (9) will occur when the estimate of the aposteriori probability of class 
oji exceeds the estimate for class ojs, while the other estimates of the aposteriori class 
probabilities Wj , Vj z, s remain dominated by P(wi|x). The first part of the condition 

will happen with the probability given by the integral of the distribution of the error 
difference in (11) under the tail defined by the margin APsi{x.) = P(u;s|x) — P(wi|x). 
Let us denote this error difference by r]^ (x) . Then the distribution of error difference 
p[r]i^ (x)] will be given by the convolution of the error distribution functions pi [e(a;i |x)] 
and p^e(ws|x)], i.e. 

/ OO 

+ e(ws|x)]p5[e(ws|x)]de(ws|x) (12) 

-OO 

Note that errors e{ujr\^), Vr are subject to various constraints (i.e. ~ 

0, — P(wr.|x) < e(u;r|x) < 1 — P(wr|x)). We will make the assumption that the 
constraints are reflected in the error probability distributions themselves and therefore 
we do not need to take them into account elsewhere (i.e. integral limits, etc). However, 
the constraints also have implications on the validity of the assumptions about the error 
distributions in different parts of the measurement space. For instance in regions where 
all the classes are overlapping, the Gaussian assumption may hold but as we move to 
the parts of the space where the aposteriori model class probabilities are saturated, such 
an assumption would not be satisfied. At the same time, one would not be expecting 
any errors to arise in such regions and the breakdown of the assumption would not be 
critical. Returning to the event in (9), the probability of the first condition being true is 
given by p[p^(x)]dz7„(x) 

Referring to equation (11), for each j the second condition will hold for j s,i 
with probability Pj [e(u;jjx)](ie(wj |x), with the exception of the last 

term, say e(wfe|x) which is constrained by 

m 

e(wfc|x) = - ^ e{ujj\x) 

j = 1 

j 



(13) 
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Thus, finally, the probahility of assigning point x to model class Ui instead of the Bayes 
optimal class LOg will be given by 



eSi(x) = /~p^.(^)P[?7^(x)](i?7^(x) 
(■APy (x)+e(wi|x) 



•••/_ 



• |x)](ie(wj |x). 

APii(x)+e(wi|x) 

APife(x)-e(wi|x)-^7= 1 

t ^ k 
t ^ I 



Pi [e(wi|x)](ie(u;i|x) 



(14) 



and the total probability of label switching will be given by 



es(x) 



i = 1 



i ^ s 



(15) 




Coaxial Omnidirectional Stereopsis 



Libor Spacek 



Department of Computer Science 
University of Essex 
Colchester, C04 3SQ, UK 
spaclOessex . ac . uk, 
http : / / cswww . essex . ac . uk/mv 



Abstract. Catadioptric omnidirectional sensors, consisting of a cam- 
era and a mirror, can track objects even when their bearings change 
suddenly, usually due to the observer making a significant turn. There 
has been much debate concerning the relative merits of several possible 
shapes of mirrors to be used by such sensors. 

This paper suggests that the conical mirror has some advantages over 
other shapes of mirrors. In particular, the projection beam from the cen- 
tral region of the image is reflected and distributed towards the horizon 
rather than back at the camera. Therefore a significant portion of the 
image resolution is not wasted. 

A perspective projection unwarping of the conical mirror images is devel- 
oped and demonstrated. This has hitherto been considered possible only 
with mirrors that possess single viewpoint geometry. The cone is viewed 
by a camera placed some distance away from the tip. Such arrangement 
does not have single viewpoint geometry. However, its multiple view- 
points are shown to be dimensionally separable. 

Once stereopsis has been solved, it is possible to project the points of 
interest to a new image through a (virtual) single viewpoint. Successful 
reconstruction of a single viewpoint image from a pair of images obtained 
via multiple viewpoints appears to validate the use of multiple viewpoint 
projections. 

The omnidirectional stereo uses two catadioptric sensors. Each sensor 
consists of one conical mirror and one perspective camera. The sensors 
are in a coaxial arrangement along the vertical axis, facing up or down. 
This stereoscopic arrangement leads to very simple matching since the 
epipolar lines are the radial lines of identical orientations in both omni- 
directional images. 

The stereopsis results on artificially generated scenes with known ground 
truth show that the error in computed distance is proportional to the 
distance of the object (as usual), plus the distance of the camera from 
the mirror. The error is also inversely proportional to the image radius 
coordinate, ie. the results are more accurate for points imaged nearer the 
rim of the circular mirror. 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3024, pp. 354-365, 2004. 
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1 Introduction 

Autonomous navigation, site modelling and surveillance applications all benefit 
from using panoramic 360° images. Omnidirectional visual sensors produce such 
images. Early attempts at using omnidirectional sensors included camera clus- 
ters [1] and various arrangements of mechanically rotating cameras and planar 
mirrors, [2], [3], [4]. These mostly had problems with registration, motion, or 
both. Fisheye lens cameras have also been used to increase the field of view [5] 
but they proved difficult because of their irreversible distortion of nearby objects 
and the lack of a single viewpoint, explained below. 

Catadioptric sensors [6] consist of a fixed dioptric camera, usually mounted 
vertically, plus a fixed rotationally symmetrical mirror above or below the cam- 
era. The advantages of catadioptric sensors derive from the fact that, unlike 
the rotating cameras, their ‘scanning’ of the surroundings is moreless instan- 
taneous. (The camera exposure time is usually shorter than the full rotation 
time). Shorter exposure means fewer image capture problems caused by motion 
and vibration of the camera, or by moving objects. 

The suitability for use in dynamic environments is clearly an important con- 
sideration, especially as one of the chief benefits of omnidirectional vision in 
general is the ability to retain objects in view even when their bearings have 
changed significantly. Catadioptric omnidirectional sensors are ideally suited to 
visual navigation [7], visual guidance applications [8], using stereopsis, motion 
analysis [9], and site mapping [10]. 

The problem with catadioptric sensors is that the details of the image can 
have relatively poor resolution, as the image depicts a large area. The resolution 
problem is unfortunately compounded by mirrors whose shapes have curved 
cross-sections. Such radially curved mirrors include the three quadric surface 
mirrors (elliptic, hyperbolic and parabolic) which are known to possess a single 
viewpoint at their focal points. 

Single viewpoint projection geometry exists when the light rays arriving from 
all directions intersect at a single point known as the (single) effective viewpoint. 
For example, by placing the centre of the perspective camera lens at the outer 
focus of the hyperbolic mirror, the inner focus then becomes the single effective 
viewpoint. 

A single viewpoint is generally thought to be necessary for an accurate un- 
warping of images and for an accurate perspective projection which is relied on 
by most computer vision methods [11]. 

The single viewpoint projection has been endorsed and recommended by [12], 
[13], [14], [15], [16] and others. 

There have been few attempts at analysing non-single viewpoint sensors [17], 
[18], although various people [19] used them previously without analysis. 

The omnidirectional sensors resolution can be improved by using several 
planar mirrors with a separate camera for each one of them. The mirrors are 
placed in some spatial arrangement, for instance in a six sided pyramid [20]. The 
reflected camera positions are carefully aligned to coincide and to form a single 
effective viewpoint. However, such arrangements are awkward, expensive, and 
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sensitive to alignment errors. The hexagonal pyramid apparatus would require 
no fewer than twelve cameras for stereopsis! Also, the coverage of the surrounding 
area is not isotropic. 

Spacek [21] proposed a solution to the above problems which combines the 
benefits of the planar mirrors (no radial distortion, no radial loss of resolution) 
with the advantages of the rotationally symmetric catadioptric sensor (short 
exposure, isotropic imaging). The only shape of mirror that satisfies these re- 
quirements is the cone. 



2 Perspective Projection through a Conical Mirror 

The benefits of the cone mirror over the radially curved mirrors were pointed 
out by Lin and Bajczy in [22]. They can be summarised as: 

1. Curved cross-section mirrors produce inevitable radial distortions. Radial dis- 
tortion is proportional to the radial curvature of the mirror. The cone has zero 
radial curvature everywhere except at its tip, which is only reflecting the camera 
anyway. 

2. Radially curved mirrors produce ‘fish eye’ effects: they magnify the objects 
reflected in the centre of the mirror, typically the camera, the robot, or the sky, 
all of which are of minimal interest. On the other hand, they shrink the region 
around the horizon, thereby reducing the available spatial resolution in the area 
which is of interest. See Figures 1 and 2 for the comparison of the hyperbolic 
and the conical mirrors. The mirrors are showing different scenes but both are 
pointing vertically upwards. 

3. The cone presents planar mirrors in cross-section. See Figure 3. The planar 
mirror does not have a complicated function mapping the camera resolution 
density onto the real world. 

Some optimised shapes of radially curved mirrors have been proposed [23] , as 
well as hybrid sensors, mirrors combining two shapes into one, and other mirrors 
of various functions. However, it seems that none of them completely address all 
of the above points. 

The cone mirror has a single effective viewpoint located at the tip. Lin and 
Bajczy proposed cutting off the tip and placing the camera lens in its place, or 
placing the tip at the forward focus point of the lens. Both of these methods 
require the camera to be very close to the mirror which results in difficulties 
with capturing enough light and with focusing, so the improvement in image 
quality over the curved mirrors is debatable. 

Our solution consists of placing the camera at a comfortable distance d from 
the tip of the conical mirror and still obtaining a useful projection, despite the 
fact that there is now an infinite number of viewpoints arranged in a circle of 
radius d around the tip of the cone. See Figure 4. Not having to fix the camera at 
a precise distance represents an additional practical benefit in comparison with 
the hyperbolic mirrors or the approach of Lin and Bajczy. 

R is both the radius and the height of the cone with a 90° angle at the 
tip. Given the field of view angle 4> of a, particular camera lens, the appropriate 
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Fig. 1. An omnidirectional image ob- 
tained using a hyperbolic mirror and an 
ordinary perspective camera. Note the 
typical predominance of the sky. 
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Fig. 2. A conical mirror image of an in- 
doors scene. The entire mirror image re- 
flects useful data. 




Fig. 3. Cross section of the conical mir- 
ror projection geometry. According to the 
laws of optics, mirrors can reflect either 
the objects or the viewpoints. The two 
situations are equivalent. In this case, the 
real camera with a field of view <j) is re- 
flected in two planar mirrors, creating 
two effective viewpoints. Each viewpoint 
has a held of view (j)/2 between its central 
projection ray and its extreme ray. The 
angle at the tip of the cone is a = 90° 
to ensure that the two effective lines of 
sight (central rays 1&2) are oriented di- 
rectly towards each other. R is the radius 
of the mirror. 



Fig. 4. Top view of the perspective pro- 
jection of P via the conical mirror. The 
circle of radius d is the locus of the view- 
points of the distant camera. 




Fig. 5. Cross section of the perspective 
projection of P via the conical mirror: the 
image is distance v behind the centre of 
the lens. 
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camera distance d = R{cotan^ — 1). This is the camera distance calculated to 
inscribe the base circle of the cone within the image. The field of view of the 
camera and the size of the mirror are thus utilised to their best advantage. For 
example, a mirror of radius R = 60mm and a camera with (f> = tt/ A results in 
d = 85mm (rounded up), d is critical for the focal mirrors but not so for the 
cone. At worst, we may lose a few pixels around the edges of the image. 



2.1 The Projections 



The image of a rotationally symmetric mirror viewed along its axis of symme- 
try is circular. It is therefore convenient to use the polar coordinates {ri,9) to 
represent the image positions and the related cylindrical coordinates (r, 9, h) for 
the 3D scene. See Figure 5 for the 9 cross section of the conical mirror and the 
associated perspective projection. Note that the points of interest along the pro- 
jection ray from a 3D scene point P{r,9,h) are collinear (forming three similar 
triangles). 

Let the image radius coordinate of the projected point P have the value 
hi (the image height of P) . The perspective projection formula is obtained from 
the collinearity property (or two similar triangles) in Figure (5): 



hi = 



vh 

d + r 



( 1 ) 



hi values are always positive (image radius). This is equivalent to using front 
projection to remove the image reversal. Equation (1) is much simpler than the 
projection equations for the radially curved mirrors. 

V is the distance of the image plane behind the centre of the thin lens in 
Gaussian optics. The focal length is normally less than v, unless we reduce 
V to focus on infinity, or use the simplifying pinhole camera assumption. The 
calibration of v is obtained by substituting the image radius of the mirror for 
hi, and R for h and r in equation (1), giving: v = rm{^ + ^)- The image radius 
Tin is determined by locating the outer contour of the mirror in the image. 

The classic perspective projection function for the single effective viewpoint 
at (0,0,0) is just a special case of equation (1), where d = 0. Suppose we cre- 
ate a thought-experiment (Gedanken) world in which all the objects are pushed 
distance d further away from the mirror axis. Then the single viewpoint pro- 
jection of the Gedanken world would result in the same image as the multiple 
viewpoint projection of the real world. It is also clear that once r is known (see 
the stereopsis method below), it is possible to reconstruct the single viewpoint 
projection of the real world by using equation (1) and setting d = 0. 

The geometry is illustrated in Figure 4. The outer circle depicts the projection 
cylinder with the radius d + v and the same axis as the cone. The projection 
cylinder for the single viewpoint at (0,0,0) is similar but has the radius v (the 
innermost circle). 

So far, we considered the projection for a fixed value of 9 and identified its 
associated viewpoint. Now we fix the elevation angle e = arctan {h/{d + r)) and 
allow 9 to vary. Imagine spinning Figure 5 around the mirror axis. All projection 
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lines with the same elevation angle will intersect the cone axis at the single point 
C{0,9,hc)- Thus the intersection point C is the viewpoint associated with the 
elevation e. We can determine the height he of C from the height h of P by again 
using the collinearity property: hc = {d ■ h) / {d + r). 

Sensors with a single (global) effective viewpoint have the same perspective 
projection in both orthogonal image dimensions (usually x,y). However, we get 
a different perspective projection in the 9 dimension, as the effective viewpoint 
C for the 9 dimension is different from the effective viewpoint (d, 0 + tt, 0) for 
the radial projection. 

9 projection is not needed for our stereopsis which uses only the radial pro- 
jection but it could be utilised if we placed two mirrors side-by-side. It has been 
used in this fashion in [24]. 

We now define the projection property whereby the viewpoints are said to 
be dimensionally separable: 

- Each radial line in the image (or equivalently each column in the unwarped 
image) has its own unique viewpoint. 

- Each concentric circle in the image (or equivalently each row in the unwarped 
image) has its own unique viewpoint. 

- Each pixel is aligned with its two (row and column) viewpoints, along the 
projection line from P. 

2.2 Registration 

We have just described the idealised projection which will be valid and accurate 
after registration, when the tip of the mirror is precisely aligned with the centre 
of the image and the axis of view coincides with the axis of the mirror. In general, 
registration needs to be performed to find the two translation and three rotation 
parameters needed to guarantee this. Existing registration methods will also 
apply and work in this situation. See [25] and [26] for good solutions to this 
problem within the context of omnidirectional vision. 

Straight lines in the 3D world become generally conic section curves when 
projected. However, lines which are coplanar with the axis of the mirror project 
into radial lines. Concentric circles around the mirror project again into concen- 
tric circles. These properties can be utilised for a simple test card registration 
method, where the test card is of the ‘shooting target’ type consisting of cross- 
hairs and concentric circles, centered on the cone axis. 



2.3 Unwarping of the Input Image 



If we were to cut and unroll the virtual projection cylinder, we would get the 
unwarped rectangular panoramic image. Therefore unwarping is the backprojec- 
tion of the real input image onto the virtual projection cylinder. The unwarping 
from the polar coordinates {hi,9i) of the input image into the (x,y) coordinates 
of the rectangular panoramic image is: 



2tt 



■ 9 , 




T' m 



X = 



(2) 
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where {xm,ym) are the desired dimensions of the imwarped image, is the 
radius of the mirror as seen in the input image, and Oi is measured in radians. 
The aspect ratio of the panoramic image is: Xm/Um = 27t. 

The direct mapping from the pixel position (x, y) of the panoramic unwarped 
image to the corresponding position (xi, yi) of the input image is presented next. 
We use polar coordinates as an intermediate step, and then equations 2. We also 
need to know the centre of the mirror in the input image (xc, j/c)- 

Xi = Xc + hi ■ cos 0i = Xc~\ — — ■ y ■ cos ( — • x) (3) 

Vm 

yi = yc + hi-si\i6i = yc+ — -V - sin(— ■ x) (4) 

Um 

We used the unwarping by two dimensional DCT (discrete cosine transform) 
of the omnidirectional input image, as described in [21], instead of the usual but 
less precise pixel interpolation methods. The main advantage of this approach 
becomes apparent when performing the radial edge-finding needed for our stereo 
(see the next section). 

See Figure 6 for the unwarping applied to a hyperbolic mirror image and 
Figure 7 for the unwarping of a conical mirror image. Note that the conical 
mirror image utilises better the available vertical resolution of the image. This 
provides better resolution for stereopsis, though the resolution near the tip of 
the mirror is clearly limited. 




Fig. 6. Unwarping of Figure 1. 




Fig. 7. Unwarping of Figure 2. 
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3 Coaxial Stereopsis 



Various arrangements have been proposed for binocular systems using catadiop- 
tric sensors. Two mirrors situated side by side can be used to compute the 
distance of objects in terms of the disparity measured as the arising difference 
in angles 9 [24]. However, such arrangement is not truly omnidirectional, as a 
large part of the scene will be obstructed by the other catadioptric sensor. 

It is better to arrange the cameras coaxially to avoid this problem. The coax- 
ial arrangement has the further major advantage of having simple aligned radial 
epipolar lines. Lin and Bajczy [27] used a single conical mirror and attempted to 
place two cameras at different distances along its axis. They had to use a beam- 
splitter to avoid the nearer camera obstructing the view of the more distant 
camera. See Figure 8. We propose an omnidirectional stereo system consisting 




-0 



Beam splitter 



Apparent viewing position 
of Camera 1 



I Camera2 




Fig. 8. Lin and Bajcsy’s omnidirectional 
stereo using a single conical mirror and 
two cameras at different distances. The 
beam splitter avoids an obstruction of 
the second camera’s view but reduces the 
amount of available light. 



Fig. 9. Omnidirectional stereo using two 
coaxial mirrors. 



of two coaxial conical mirrors pointing in the same direction, each with its own 
camera. See Figure 9. 

We wish to obtain a triangulation formula for the radial distance of objects 
r. The radial distance is measured from the axis of the mirror(s) to any 3D 
scene point P, which has to be in the region that is visible by both cameras 
(the common region). See Figure 9. The common region is annular in shape in 
3D, with a triangular cross-section extending to infinity. It is bounded above 
and below in the (r,h) plane by the lines: h = and h = s. The angle 

at the tip of the common region triangle is The distance of the tip is: 
rmin = + 1) — d. Stereopsis cannot be employed anywhere nearer than rmin- 

In order to obtain the triangulation formula, we subtracted two instances of 
equation (1) for two coaxial mirrors separated by distance s (s is measured along 
the h axis). We assume here that the parameters v and d are the same for both 
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cameras, though this assumption can be easily relaxed if necessary. 



r = 




-d 



( 5 ) 



This is very similar to the usual triangulation formula from classical side-by- 
side stereopsis but here the disparity ha — hn , is the radial disparity. The extra 
distance d is correctly subtracted. The similarity of the formulae is not surprising, 
as the two reflected (virtual) cameras resemble a classical side-by-side system 
within the plane of orientation 9. 




Fig. 10. Edge map of the unwarped image in Figure 7. 



3.1 Radial Edge Finding and Matching 

The radial epipolar matching is driven by edges whose gradient is primarily in 
the radial direction. We And those by the radial edge-finder using the DCT and 
the polar coordinates {hi, 9i) of the input image, as described in [21]. The main 
benefit of this approach is that the slow unwarping process is avoided. We also 
obtain the partial derivatives of the image function in hi and 9i directions, which 
is going to be useful for a polar optic flow. 

The unwarping is needed only for the convenience of human viewing, such as 
in Figure 10, showing a traditional edge map of the unwarped image, using [28]. 
The stereopsis correspondence computation is driven primarily by the horizontal 
edges in this example. 

The radial edge finding consists of the following steps: 

1. Perform forward DCT transform on the omnidirectional input image, using 
{xi,yi) coordinates. 

2. Convert the input image coordinates at which the radial gradient component 
is to be computed from the rectangular form (xi,yi) to the polar form {hi,9i) 
and substitute to the normal inverse DCT function. 

3. Apply the radial edge function (ref) defined in [21]. This function was obtained 
by partial differentiation of the inverse DCT transform in polar coordinates with 
respect to hi. 

In other words, we are differentiating the inverse transform function instead 
of differentiating the image. This is legitimate as the DCT has a finite number 
of terms. The output is the desired radial edge map in the same format as the 
original input image, ie. it is the radial edge map of the circular mirror in the 
rectangular image coordinates (xi,yi). Similar process can be followed to And 
the partial derivatives of an image in 9 direction, or higher derivatives. 
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The radial edge finder should be of interest to omnidirectional vision gener- 
ally, as it can be used with any rotationally symmetric mirror. The unwarping 
is unnecessary when using autonomous vision methods that work in polar coor- 
dinates. 

It is not necessary to generate the entire rectangular edge map of the second 
image when doing the stereo matching. The ref can be evaluated at any randomly 
selected points with sub-pixel accuracy. The outline of the radial stereo matching 
algorithm is as follows: 

1. Given a pair of stereo images fi and / 2 , find all feature points in /i where 

ig significant {absQ is the absolute value function). 

2. Find out the 9i value of the selected feature point. 

3. Keeping 9i fixed, evaluate ref along the epipolar radial line in and store the 
image gradient vectors for both epipolar lines in two buffers. 

4. Match the buffers looking for similar values of the gradient vectors and paying 
attention to sensible ordering of the matches plus any other stereopsis matching 
tricks. 

5. Compute the distance of objects for all successful matches, using the matched 
radial position values hn and and the triangulation equation (5). 

6. Move to the next value of 9i which has significant image feature(s) and repeat 
from 2. 

There are other sophisticated stereo matching methods that could be adapted 
to these circumstances, for example [29]. 

3.2 Steropsis Discussion and Results 

In the illustrated arrangement the view is directed at the horizon, which is nor- 
mally rich in natural visual features of high contrast that are useful for outdoors 
navigation [7]. For closer visual guidance indoors, the entire apparatus can be 
simply inverted. The visible regions will lie either above or below the horizontal 
plane touching the tip of the mirror, respectively. For very close range stereo, 
we suggest inverting just the top mirror and camera, so that the tips of the 
mirrors are facing away from each other. In each case the following triangulation 
and matching will be much the same. The only combination to be avoided for 
stereopsis is the one with the tips of the mirrors facing each other, as this would 
result in no common region visible by both cameras. 

Our arrangement is quite different from that proposed by Lin and Bajczy. 
The resulting triangulation formula is different. Our system is simpler, there is 
no loss of light through the beam splitter, and we gain better image quality by 
being able to view large size conical mirrors. 

Lin and Bajczy did not specify a 90° angle at the tip of their mirror so our 
perspective projection, as described in section 2, has different specific properties. 

We have tested our stereopsis method on artificial images with known ground 
truth (admittedly not as demanding a test as using real images) and found the 
errors in r to grow linearly with r + d. The errors are a function of the image 
resolution, so for a fixed r, they are inversely proportional to h. This means that 
the errors are smaller for points imaged nearer the edge of the mirror, where 
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the 9 resolution is better. There is a good agreement between our results and a 
theoretical error prediction based on differentiation of the perspective projection 
formula. 

4 Conclusion 

This paper has identified the conical mirror as a good solution for catadioptric 
omnidirectional sensors. 

The benefits of conical mirrors had been hitherto mostly overlooked because 
of the demands for a single viewpoint projection. We conclude that the single 
viewpoint is not necessary for an accurate perspective projection when using 
the conical mirror with a 90° angle at the tip. Such conical mirrors provide 
a useful model of projection when viewed from any reasonable distance by an 
ordinary perspective camera. Conical mirrors are less sensitive to the precise 
distance of the camera than are hyperbolic and elliptic mirrors. The ability to 
view the mirror from a greater distance is desirable since it allows the use of larger 
mirrors with relatively better optical quality. Given the same physical surface 
quality (roughness), the optical quality will be proportional to the dimensions of 
the mirror. The radial distortion properties of conical mirrors are better when 
compared to other circular mirrors. Last but not least, conical mirrors direct the 
camera resolution into more useful parts of the surroundings and their resolution 
density is well behaved. 

The unwarping methods and experiments demonstrated the concept of an 
accurate perspective projection via multiple viewpoints. 

The benefits of the coaxial omnidirectional stereo system are both practi- 
cal (objects do not disappear from view due to vehicle rotation), and theoreti- 
cal/computational (the epipolar geometry is simpler than in classical stereopsis). 

References 

1. Swaminathan, R., Nayar, S.K.: Nonmetric calibration of wide-angle lenses and 
polycameras. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 
(2000) 1172-1178 

2. Rees, D.: Panoramic television viewing system. US Patent No. 3,505,465 (1970) 

3. Kang, S., Szeliski, R.: 3-d scene data recovery using omnidirectional multibaseline 
stereo. IJCV 25 (1997) 167-183 

4. Ishiguro, H., Yamamoto, M., Tsuji, S.: Omni-directional stereo. PAMI 14 (1992) 
257-262 

5. Shah, S., Aggarwal, J.: Mobile robot navigation and scene modeling using stereo 
fish-eye lens system. MVA 10 (1997) 159-173 

6. Nayar, S.: Catadioptric omnidirectional cameras. In: CVPR97. (1997) 482-488 

7. Rushant, K., Spacek, L.: An autonomous vehicle navigation system using 

panoramic vision techniques. In: International Symposium on Intelligent Robotic 
Systems, ISIRS98. (1998) 275-282 

8. Pajdla, T., Hlavac, V.: Zero phase representation of panoramic images for image 
vased localization. In: Computer Analysis of Images and Patterns. (1999) 550-557 




Coaxial Omnidirectional Stereopsis 



365 



9. Yagi, Y., Nishii, W., Yamazawa, K., Yachida, M.: Rolling motion estimation for 
mobile robot by using omnidirectional image sensor hyperomnivision. In: ICPR96. 
(1996) 

10. Yagi, Y., Nishizawa, Y., Yachida, M.: Map-based navigation for a mobile robot 
with omnidirectional image sensor copis. Trans. Robotics and Automation 11 
(1995) 634-648 

11. Baker, S., Nayar, S.: A theory of single-viewpoint catadioptric image formation. 
IJCV 32 (1999) 175-196 

12. Baker, S., Nayar, S.: A theory of catadioptric image formation. In: ICCV98. (1998) 
35-42 

13. Geyer, C., Daniilidis, K.: A unifying theory for central panoramic systems and 
practical applications. In: ECCVOO. (2000) 

14. Geyer, C., Daniilidis, K.: Properties of the catadioptric fundamental matrix. In: 
ECCV02. Volume 2. (2002) 140 ff. 

15. Baker, S., Nayar, S.: Single viewpoint catadioptric cameras. In: PVOl. (2001) 
39-71 

16. Svoboda, T., Pajdla, T.: Epipolar geometry for central catadioptric cameras. IJCV 
49 (2002) 23-37 

17. Swaminathan, R. Grossberg, M., Nayar, S.: Caustics of catadioptric cameras. In: 
ICCV02. (2001) 

18. Fiala, M., Basu, A.: Panoramic stereo reconstruction using non-svp optics. In: 
ICPR02. Volume 4. (2002) 27-30 

19. Yagi, Y., Kawato, S.: Panoramic scene analysis with conic projection. In: IROS90. 
(1990) 

20. Yokoya, N., Iwasa, H., Yamazawa, K., Kawanishi, T., Takemura, H.: Generation of 
high-resolution stereo panoramic images by omnidirectional imaging sensor using 
hexagonal pyramidal mirrors. In: ICPR98. (1998) 

21. Spacek, L.: Omnidirectional catadioptric vision sensor with conical mirrors. In: 
Towards Intelligent Mobile Robotics, TIMR03. (2003) 

22. Lin, S., Bajcsy, R.: True single view point cone mirror omni-directional catadioptric 
system. In: ICCVOl. Volume 2. (2001) 102-107 

23. Hicks, A., Bajscy, R.: Reactive surfaces as computational sensors. In: The sec- 
ond IEEE Workshop on Perception for Mobile Agents. Held in Gonjunction with 
GVPR’99. (1999) 82-86 

24. Brassart, E., et ah: Experimental results got with the omnidirectional vision sensor: 
Syclop. In: EEE Workshop on Omnidirectional Vision (OMNIVIS’OO). (2000) 145- 
152 

25. Geyer, G., Daniilidis, K.: Structure and motion from uncalibrated catadioptric 
views. In: GVPROl. Volume 1. (2001) 279-286 

26. Geyer, G., Daniilidis, K.: Paracatadioptric camera calibration. IEEE PAMI 24 
(2002) 1-10 

27. Lin, S., Bajcsy, R.: High resolution catadioptric omni-directional stereo sensor for 
robot vision. In: IEEE International Gonference on Robotics and Automation, 
Taipei, Taiwan. (2003) 12-17 

28. Spacek, L.: Edge detection and motion detection. Image and Vision Gomputing 4 
(1986) 43-56 

29. Sara, R.: Finding the largest unambiguous component of stereo matching. In: 
EGGV (3). (2002) 900-914 




Classifying Materials from Their Reflectance 

Properties 



Peter Nillius and Jan-Olof Eklundh 



Computational Vision & Active Perception Laboratory (CVAP) 
Department of Numerical Analysis and Computer Science 
Royal Institute of Technology (KTH), S-100 44 Stockholm, Sweden 
{nillius , j oe}@nada . kth . se 



Abstract. We explore the possibility of recognizing the surface material 
from a single image with unknown illumination, given the shape of the 
surface. 

Model-based PCA is used to create a low-dimensional basis to represent 
the images. Variations in the illumination create manifolds in the space 
spanned by this basis. These manifolds are learnt using captured illumi- 
nation maps and the CUReT database. Classification of the material is 
done by finding the manifold closest to the point representing the image 
of the material. 

Testing on synthetic data shows that the problem is hard. The materials 
form groups where the materials in a group often are mis-classifed as one 
of the other materials in the group. With a grouping algorithm we find 
a grouping of the materials in the CUReT database. Tests on images of 
real materials in natural illumination settings show promising results. 



1 Introduction 

The appearance of a surface depends on its shape, the illumination and the 
material of the surface. In a normal vision task none of these properties are 
known a priori. Despite that, human observers are very good at determining the 
material of an object, even in the absence of texture. The estimation is done 
purely based on the reflectance properties of the surface. We will explore if this 
can be done computationally when there is no knowledge about the illumination, 
but the shape of the object is known. 

Recent research [2,11,10,8] has shown that the reflected light from a Lam- 
bertian surface can be represented with a low-dimensional model although the 
variations in illumination are infinite. This is because the surface acts as a low- 
pass Alter on the incident illumination, making the images in practice lie in a 
low-dimensional subspace. Other work, [12,9], indicates that this holds for many 
other types of surface reflectance, e.g. many of the materials in the CUReT 
database, [3]. 

In this paper we classify the material of an object of known shape from a 
single image, when the illumination is unknown. In [4] Dror et al recognizes 
materials under similar assumptions. They use histograms of Alter responses 
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and rely on the structure of the specular reflections to classify the material. Our 
approach is different in that we represent the images using a generative model, 
allowing us to discriminate between materials without specular reflections such 
as felt and velvet. 

2 A Low-Dimensional Generative Model for Image 
Irradiance 

To And a basis to represent the images, we use the framework described in [9]. 
With this framework we can, for a given shape, construct a low-dimensional 
basis that can represent the images of an object of a wide variety of materials 
under more or less arbitrary illumination. 

The basis is created using model-based PCA. Rather than performing PCA 
on a set of captured images, the PCA is analytically derived from the image 
formation model. This makes it possible to create a basis for a wide variety of 
conditions using a set of captured illumination maps and a database of surface 
reflectance functions (BRDF’s). The illumination maps are undergoing all 3D 
rotations to take into account every possible lighting configuration. 

The BRDF acts as a low-pass Alter on the incident illumination, making the 
reflected light band-limited. Hence, by formulating the image formation in fre- 
quency space we can derive a finite dimensional model of the image irradiance 
even when the illumination is unknown and arbitrary. As in [2,11] the illumi- 
nation is represented by its spherical harmonics coefficients, TJ". The BRDF is 
represented by its coefficients, 6^^, in the basis by Koenderink and van Doom, 
[5], based on the Zernike polynomials. Computing the image irradiance using 
these representations leads to a basis for image irradiance E. At this point we 
approximate the camera projection as orthographic which makes the image ir- 
radiance uniquely determined by the surface normal {a, (3). This results in the 
representation 

E{a,P) = Y^c^E,{a,/3) (1) 

i 

where Cj = {l,m,o,p and q are given by i due to an ordering of the 

basis functions). The EiS are the image irradiance basis functions and are prod- 
ucts of the Wigner D-functions (for real spherical harmonics) and the Zernike 
polynomials. See [9] for their explicit form. 

The basis can represent the image irradiance from any isotropic surface under 
any illumination. In the general case an infinite number of basis functions are 
needed, but for many materials which act as low-pass filters on the illumination, 
the sum can be truncated and still be an accurate representation. This finite 
representation allows us to analytically derive the principal components. The 
variations in the illumination and surface reflectance properties are described by 
the covariance matrices of their respective coefficients, L™ and The resulting 
principal components are linear combinations of the basis functions Ei. See [9] 
for details. 
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When we use the PCA basis to represent images we assume that the illumi- 
nation is the same for each point in the image. This assumptions is true if the 
light source is distant. It is also necessary that there are no cast shadows or local 
inter-reflections, which is true when the object shape is convex. For non-convex 
objects this model is an approximation, where the quality of the approxima- 
tion depends on the concavities and the material/illumination conditions. For 
instance, bright objects will have stronger inter-reflections than dark objects. 

An important property of model-based PCA is that we can relate the prin- 
cipal components to the properties of the illumination and surface reflectance. 
There is an explicit relation between the coefficients in the PCA basis to the 
coefficients of the illumination and the BRDF. From the coefficients of illumi- 
nation and BRDF, the coefficients, d, in the PCA space can be computed by a 
simple matrix product 

d = Ac. (2) 

where the elements of c are Ci = The PCA basis as well as the matrix 

A are computed from the shape of object and the variations in the illumination 
and BRDF. 

Another important aspect of the PCA basis regards robustness. The first ba- 
sis function is selected so that it maximizes the signal variance of the component 
it represent. The subsequent basis functions maximize the same variance while 
being orthogonal to all the previous functions. This means that the first basis 
function has the highest signal-to-noise ratio (SNR), on average, hence being the 
most robust component to estimate. The following components will be less and 
less robust to estimate. In other words, selecting the number of basis functions 
to use is not just a question of saving computer memory and computation time, 
but also a question of robustness and regularization. 



3 Material Recognition 

Our approach to material recognition is to estimate the coefficients in the ba- 
sis described in the previous section from the images and compare them to a 
database of known materials. 

Since the illumination is not known we cannot calculate what the correspond- 
ing coefficients should be for the materials in the database. We need to take into 
account all possible illuminations and find the illumination-material pair that 
best matches the image. For this to be possible it is necessary that the variations 
in the coefficient space are much smaller than the variations in the illumination 
(which are infinite). If this is true we can learn the variations in the coefficient 
space with only a limited amount of training illuminations. 

Smooth variations in the illumination result in a manifold of points in the 
coefficient space. To learn these manifolds we take a set of illumination maps 
and rotate them over all rotations. To store the manifolds we sample them by 
sampling the rotation group S'0(3) and calculate the coefficients for each sample 
point for every illumination map and material. 
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The image is classified by finding the manifold which is closest to the point 
representing the image. The procedure is very much the same as in [7]. 

3.1 Learning the Manifolds 

The manifold for each material is learnt from a set of illumination maps that are 
rotated over the full rotation group. The rotation group is sampled and for each 
rotation (a, /?, 7 ) the spherical harmonic coefficients of the rotated illumination 
map are calculated. The point on the manifold is given by equation (2). 

To sample the rotation group we sample the surface of a sphere and combine 
it with a circle. The sphere is sampled by starting from an icosahedron inscribed 
in the sphere. The icosahedron is recursively subdivided by projecting the mid- 
point of each edge onto the surface of the sphere forming four new triangles for 
each old triangle, [1]. The circle is sampled at a density as close as possible to 
the sampling of the sphere. 

3.2 Finding the Closest Manifold 

To find the closest manifold to a point we simply go through all points on each 
manifold and calculate the distance to the point to be classified. The distance 
measure is the sum of squared differences in coefficient space. 

To aid our algorithm in being illumination invariant we take a number of 
steps. The first element of the point is discarded. It corresponds to the constant 
function of the basis and captures the variations in the ambient component of 
the illumination. By discarding it the algorithm becomes independent of such 
variations. 

The remaining elements are normalized to get brightness independence. It 
also means that we will not be able to differentiate between bright and dark 
materials, although this could be added at a later stage by comparing the signal 
variances of the images. 

4 Discrimination of Materials in the CUReT Database 

Before we move on to real images we need to assess what can be done. How well 
can materials be discriminated from their reflectance properties alone? Figure 1 
demonstrates that many materials look similar to the human eye. 

To test this we will analyze how well the materials in the CUReT database can 
be discriminated in synthetic images, i.e. when there is no noise. The illumination 
is considered to be unknown. The algorithm is tested on images generated from 
one of the illumination maps, while the other illumination maps are used to build 
the manifolds for classification. This is repeated for all nine illumination maps 
(the leave-one-out principle). 

We don’t actually need to generate any images. Using the low-dimensional 
basis framework described in Section 2 we can directly from the illumination and 
material coefficients compute the coefficients in the low-dimensional basis of the 
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Fig. 1. Rendered images using BRDFs from the CUReT database. Classifying materi- 
als from their reflectance properties can be very hard, as is in this case. If you disregard 
the color many materials look very similar. 




Fig. 2. Sampled manifolds in the coefficient space of materials 1-Felt (blue rings) and 
7-Velvet (red crosses) under one of the illumination maps undergoing all 3D rotations, 
5'0(3). 



image. This allows for extensive testing. Each of the 48 materials used is tested 
with nine illumination maps, each under 462 different rotations, summing up to 
a total of 200 000 images used for testing. 

Figure 3 shows the classification rates for the different materials. The correct 
classification rates, which can be seen in the diagonal, range between 5 and 80 
percent. Materials with a high classification rate are 7-velvet and 61-moss which 
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Fig. 3. Recognition rates for the CUReT materials. Each row shows the classification 
rates for a particular materials, e.g. the leftmost element in the first row is the rate 
that material no. 1 is classified as materials no. 1, the second element is the rate the 
material one is classified as material no. 2. The diagonal is the correct classification 
rate. These results are discussed more in the text. 



have particular reflectance properties. Glossy materials have in general a higher 
recognition rate than matte materials. 

What is interesting is that the materials seem to form groups, where a mate- 
rial in a group systematically is being mis-classifled as one of the other materials 
in that same group. This becomes apparent when we order the materials in a 
particular way. Figure 4 shows the exact same classification rates as Figure 3, 
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1 20 52 28 26 39 2 3 18 43 16 21 48 37 14 22 47 49 53 6 10 59 50 8 45 11 60 24 4 12 5 55 17 36 9 25 27 23 34 41 15 33 7 13 19 61 35 




1 20 52 28 26 39 2 3 18 43 16 21 48 37 14 22 47 49 53 6 10 59 50 8 45 11 60 24 4 12 5 55 17 36 9 25 27 23 34 41 15 33 7 13 19 61 35 



Classified as 

Fig. 4. When the classihcation rates from Figure 3 are sorted in a particular way a 
pattern emerge. The materials form groups. Materials within a group often are classified 
as one of the other materials in the same group. The largest group can be seen as a grey 
block in the top left corner of the matrix. These are the matte materials, 1-Felt, 20- 
Styrofoam, ..., 24- Rabbit Fur. After the matte materials comes a group of more glossy 
materials, 12-Rough Plastic, ..., 36-Limestone. Next comes a group of shiny materials 
9-Frosted Glass to 33-Slate_a. Last is a group of materials with asperity type scattering, 
7-Velvet, 13- Artificial Grass, 19-Rug_b and 61-Moss. 



but with the materials ordered using a hierarchical grouping algorithm that will 
be described in the next section. We begin to distinguish blocks in the diagonal 
of the matrix. There is a large block of matte materials in the top left corner, 
formed by the materials 1-Felt, 20-Styrofoam, ..., 24-Rabbit Fur. Following the 
matte materials is a group of glossy materials, 4- Rough Plastic , ..., 15-Foil. Last 
comes 7- Velvet and a group of velvet-like appearance (asperity scattering), 13- 
Art. Grass, 19-Rug_b and 61-Moss. Finally we have 35-Painted Spheres which 
forms a group of its own. 
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4.1 Visual Grouping of the Materials 

It is clear that we cannot expect to distinguish between some of the materials 
in the CUReT database. Instead we can try to find groups in which to classify 
the materials. 

Using the matrix containing the classification rates we group the materials. 
The grouping is done in a greedy fashion. We start with groups of single materi- 
als. Then the two groups that maximize the average recognition rate are joined. 
This is repeated until the desired number of groups is reached. To select the 
number of groups one can e.g. look at the ratio between the recognition rates 
and the rate of selecting the correct material by chance. 

Dividing the CUReT database into 9 groups results in the grouping in Fig- 
ure 5. We have labeled the groups according to the characteristics of their mem- 
bers. All matte materials end up in one group. Materials having specular re- 
flectance are split up in three groups. The last five groups are materials that 
did not fit into any group. These materials have a high recognition rate on their 
own. 



Group 


Members 


Label 


1 


1, 2, 3, 6, 8, 10, ... 


Matte 


2 


4, 5, 12, 17, 36, 55 


Glossy 


3 


9, 23, 25, 27, 34, 41 


Shiny 


4 


15, 33 


Shinier 


5 


7 


Velvet 


6 


13 


Art. Grass 


7 


19 


Rug 


8 


61 


Moss 


9 


35 


Spheres 




Fig. 5. Classification rates when the materials are grouped into nine groups. Not 
all members were listed in the matte gronp due to space limitations, but this group 
contains all materials that are not in the other groups. 



More or less all the groups are sometimes mis-classified as matte materials. 
This makes sense. In the testing we take all rotations of the illumination into 
account. This means that sometimes the dominant light source in the scene 
will be behind the object. Hence, there will be no specularity on the object to 
differentiate it from a matte material. 

5 Classifying the Material in Real Images 

To test the algorithm we glued five different real materials onto cylinders, see 
Figure 6. Cylinders were chosen due to the difficulty of gluing a non-stretchable 
materials onto a sphere. The cylinders where photographed using a digital cam- 
era in different illumination conditions, including outdoor sunny, outdoor cloudy 
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Fig. 6. The algorithm was tested on images of cylinders with the pieces of five different 
real materials glued onto them. Top row from left to right: felt, velvet 1, velvet 2, 
leather and imitation leather. Bottom row: leather in five of the different illumination 
conditions. 



and indoor with indirect light from a window. Before classification the images 
were radiometricly calibrated, using the method in [6]. The geometry of the 
cylinders were estimated by manually marking where in the image the cylinders 
were. 

Using the framework from Section 2 we computed a basis for the cylinder. 
A total of six basis functions were used in the experiments. The coefficients for 
the image were estimated by projecting the image onto the basis. The image 
was then classified by finding the closest manifold as described in Section 3. The 
manifolds were this time learned using all nine illumination maps. 

Figure 7 shows some of the images being classified. Note how well the basis 
is able to represent the image irradiance in all cases. 

A total of 84 images were used in the experiment. Table 1 summarizes the 
results. As predicted by the synthetic experiments only a few of the images 
where correctly classified on an individual basis. Felt and the two velvets have a 
recognition rate of 5% to 7.7%, which is still several times greater than chance, 
which is 1/48 ~ 2.1%. When using the grouping in Figure 5 the recognition 
rates are higher. Felt is to a large extent classified as matte. The leather here 
is classified as Shiny or Shinier, while the leather in the database is categorized 
as Glossy. This could be because our leather is shiner than the leather in the 
database. Visually, at least, it appears so. The imitation leather is also mostly 
classified as Shiny or Shinier. 

So far the results match the synthetic results fairly well. The velvet however 
does not. The synthetic results indicate that velvet should be fairly easy to 
recognize, but in our experiments the two velvet cylinders are mostly classified 
as matte. On the other hand, they are also often classified as one the groups 
Grass, Rug and Moss, which have the same type of surface reflectance as Velvet. 
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Fig. 7. Examples of classified images: (a)-(d) images for Felt, (a) calibrated gray 
image, (b) reconstructed gray image (this is what the algorithm “sees”), (c) image 
and reconstructed intensity profiles, (d) distances to the ten closest materials. Here 
the material is correctly classified as felt, (e)-(h) show the same images for Velvet 1. 
The material is here incorrectly classified as 24-Rabbit Fur, 7-Velvet comes third place. 
(i)-(l) images for leather which in this case is classified as 41-Brick_b, 5-Leather is the 
third closest material, (m)-(p) imitation leather: classified as 55-Orange, 5-Leather on 
seventh place. Notice how well the basis represent the irradiance for the different cases. 



Table 1. Classification Rates for the Cylinder Images 



Material 


Correct 


Matte Glossy Shiny Shinier Velvet Grass Rug Moss Spheres 


Felt 


7.7 


77 


7.7 


0 


15 


0 


0 


0 


0 0 


Leather 


0 


25 


6.2 


44 


19 


6.2 


0 


0 


0 0 


Im. Leather 


0 


10 


0 


40 


35 


10 


0 


0 


0 5 


Velvet 1 


5 


55 


0 


5 


5 


5 


10 


10 


10 0 


Velvet 2 


6.7 


40 


0 


33 


0 


6.7 


0 


13 


6.7 0 
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6 Conclusions 

We have investigated the problem of classifying the surface material from a single 
image with unknown illumination, given the surface shape. 

Recognizing materials from their reflectance properties is hard. We cannot 
expect to distinguish between many of the materials in the CUReT database. 
Instead we should find groups in which to classify the materials. 

The grouping produced by our algorithm suggests that we can expect to dis- 
tinguish between matte materials, special materials such as velvet and materials 
of different grades of shininess. 



Acknowledgments. This work was done within the EU-IST project IST-2000- 

29688 Insight2-|-. The support is gratefully acknowledged. 

References 

1. D.H. Ballard and C.M. Brown. Computer Vision. Prentice-Hall, 1982. 

2. R. Basri and D. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans. 
Pattern Analysis and Machine Intelligence, 25(2):218-233, February 2003. 

3. K.J. Dana, B. van Ginneken, S.K. Nayar, and J.J. Koenderink. Reflectance and 
texture of real-world surfaces. ACM Transactions on Graphics, 18(l):l-34, January 
1999. 

4. R.O. Dror, E.H. Adelson, and A.S. Willsky. Recognition of surface reflectance prop- 
erties from a single image under unknown real-world illumination. In Workshop of 
recognizing objects under varying illumination, 2001. 

5. J.J. Koenderink and A.J. van Doom. Phenomenological description of bidirectional 
surface reflection. J. Optical Soc. of Am. A, 15(11):2903-2912, November 1998. 

6. T. Mitsunaga and S.K. Nayar. Radiometric self calibration. In Proc. Computer 
Vision and Pattern Recognition, pages I: 374-380, 1999. 

7. H. Murase and S.K. Nayar. Visual learning and recognition of 3-d objects from 
appearance. Int. Journal of Computer Vision, 14(l):5-24, January 1995. 

8. P. Nillius and J.O. Eklundh. Low-dimensional representations of shaded surfaces 
under varying illumination. In Proc. Computer Vision and Pattern Recognition, 
pages 11:185-192, 2003. 

9. P. Nillius and J.O. Eklundh. Phenomenological eigenfunctions for image irradiance. 
In International Conference on Computer Vision, pages 568-575, 2003. 

10. R. Ramamoorthi. Analytic pea construction for theoretical analysis of lighting 
variability in images of a lambertian object. IEEE Trans. Pattern Analysis and 
Machine Intelligence, 24(10):1322-1333, October 2002. 

11. R. Ramamoorthi and P. Hanrahan. On the relationship between radiance and 
irradiance: determining the illumination from images of a convex lambertian object. 
J. Optical Soc. of Am. A, 18(10):2448-2458, October 2001. 

12. R. Ramamoorthi and Hanrahan P. A signal-processing framework for inverse ren- 
dering. In SIGGRAPH, 2001. 




Seamless Image Stitching in the Gradient 

Domain* 



Anat Levin, Assaf Zomet**, Shmuel Peleg, and Yair Weiss 

School of Computer Science and Engineering 
The Hebrew University of Jerusalem 
91904, Jerusalem, Israel 

{alevin, peleg, yweiss}@cs .huj i . ac . il , zomet @cs . columbia.edu 



Abstract. Image stitching is used to combine several individual im- 
ages having some overlap into a composite image. The quality of image 
stitching is measured by the similarity of the stitched image to each of 
the input images, and by the visibility of the seam between the stitched 
images. 

In order to define and get the best possible stitching, we introduce several 
formal cost functions for the evaluation of the quality of stitching. In 
these cost functions, the similarity to the input images and the visibility 
of the seam are defined in the gradient domain, minimizing the disturbing 
edges along the seam. A good image stitching will optimize these cost 
functions, overcoming both photometric inconsistencies and geometric 
misalignments between the stitched images. 

This approach is demonstrated in the generation of panoramic images 
and in object blending. Comparisons with existing methods show the 
benefits of optimizing the measures in the gradient domain. 



1 Introduction 

Image stitching is a common practice in the generation of panoramic images and 
applications such as object insertion, super resolution [1] and texture synthesis 
[2]. An example of image stitching is shown in Figure 1. Two images I\,l 2 capture 
different portions of the same scene, with an overlap region viewed in both 
images. The images should be stitched to generate a mosaic image I. A simple 
pasting of a left region from I\ and a right region from I 2 produces visible 
artificial edges in the seam between the images, due to differences in camera 
gain, scene illumination or geometrical misalignments. 

The aim of a stitching algorithm is to produce a visually plausible mosaic 
with two desirable properties: First, the mosaic should be as similar as possible 
to the input images, both geometrically and photometrically. Second, the seam 
between the stitched images should be invisible. While these requirements are 
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** Current Address: Computer Science Department, Columbia University, 500 West 
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Input image li 



Pasting of Ji and I2 



Input image I2 



Stitching result 



Fig. 1. Image stitching. On the left are the input images, lj is the overlap region. On 
top right is a simple pasting of the input images. On the bottom right is the result of 
the GISTl algorithm. 



widely acceptable for visual examination of a stitching result, their definition as 
quality criteria was either limited or implicit in previous approaches. 

In this work we present several cost functions for these requirements, and 
define the mosaic image as their optimum. The stitching quality in the seam 
region is measured in the gradient domain. The mosaic image should contain a 
minimal amount of seam artifacts, i.e. a seam should not introduce a new edge 
that does not appear in either /i or / 2 . As image dissimilarity, the gradients of 
the mosaic image I are compared with the gradients of This reduces the 

effects caused by global inconsistencies between the stitched images. We call our 
framework GIST: Gradient-domain Image STitching. 

We demonstrate this approach in panoramic mosaicing and object blending. 
Analytical and experimental comparisons of our approach to existing methods 
show the benefits in working in the gradient domain, and in directly minimizing 
gradient artifacts. 



1.1 Related Work 

There are two main approaches to image stitching in the literature, assuming that 
the images have already been aligned. Optimal seam algorithms[3,2,4] search for 
a curve in the overlap region on which the differences between /i, I 2 are minimal. 
Then each image is copied to the corresponding side of the seam. In case the 
difference between /i, I 2 on the curve is zero, no seam gradients are produced in 
the mosaic image I . However, the seam is visible when there is no such curve. 
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for example when there is a global intensity difference between the images. This 
is illustrated on the first row of Figure 2. In addition, optimal seam methods are 
less appropriate when thin strips are taken from the input images, as in the case 
of manifold mosaicing [5] . 

The second approach minimizes seam artifacts by smoothing the transition 
between the images. In Feathering [6] or alpha blending, the mosaic image I 
is a weighted combination of the input images I\,l 2 - The weighting coefficients 
(alpha mask) vary as a function of the distance from the seam. In pyramid 
blending)?], different frequency bands are combined with different alpha masks. 
Lower frequencies are mixed over a wide region, and fine details are mixed in a 
narrow region. This produces gradual transition in lower frequencies, while re- 
ducing edge duplications in textured regions. A related approach was suggested 
in [8], where a smooth function was added to the input images to force a con- 
sistency between the images in the seam curve. In case there are misalignments 
between the images [6], these methods leave artifacts in the mosaic such as double 
edges, as shown in Figure 2. 

In our approach we compute the mosaic image I by an optimization process 
that uses image gradients. Computation in the gradient domain was recently used 
in compression of dynamic range[9], image editing [10], image inpainting [11] and 
separation of images to layers [12,13,14,15]. The closest work to ours was done 
by Perez et. al. [10], who suggest to edit images by manipulating their gradients. 
One application is object insertion, where an object is cut from an image, and 
inserted to a new background image. The insertion is done by optimizing over 
the derivatives of the inserted object, with the boundary determined by the 
background image. In sections 4, 5 we compare our approach to [10]. 



2 GIST: Image Stitching in the Gradient Domain 

We describe two approaches to image stitching in the gradient domain. Sec- 
tion 2.1 describes GISTl, where the mosaic image is inferred directly from the 
derivatives of the input images. Section 2.2 describes GIST2, a two-steps ap- 
proach to image stitching. Section 2.3 compares the two approaches to each 
other, and with other methods. 



2.1 GISTl: Optimizing a Cost Function over Image Derivatives 

The first approach, GISTl, computes the stitched image by minimizing a cost 
function Ep. Ep is a dissimilarity measure between the derivatives of the stitched 
image and the derivatives of the input images. 

Specifically, let I\,l 2 be two aligned input images. Let t\ (t 2 resp.) be the region 
viewed exclusively in image {I 2 resp.), and let oj be the overlap region, as 
shown in Figure 1, with n fl T 2 = n fl w = T 2 H a; = 0. Let W he & weighting 
mask image. 
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Fig. 2. Comparing stitching methods with various sources for inconsistencies between 
the input images. The left side of I\ is stitched to right side of l 2 - Optimal seam 
methods produce a seam artifact in case of photometric inconsistencies between the 
images (first row). Feathering and pyramid blending produce double edges in case of 
horizontal misalignments (second row). In case there is a vertical misalignments (third 
row), the stitching is less visible with Feathering and GIST. 



The stitching result / of GIST I is defined as the minimum of Ep with respect 
to i: 

Ep (/; /i, /2, Vf) = dp{Vi, V/i, n U w, IT) + dp{VI, VI 2 , T 2 U w, C/ - IT) (1) 

where 1/ is a uniform image, and dp(Ji, J 2 , <(>, IT) is the distance between Ji, J 2 
on (j)' 

dp( Ji, J 2 , IT) = ^ IT(g) II Mq) - Uq) % (2) 

qe<p 

with II • lip denoting the £p-norm. 

The dissimilarity Ep between the images is defined by the distance between 
their derivatives. A dissimilarity in the gradient domain is invariant to the mean 
intensity of the image. In addition it is less sensitive to smooth global differences 
between the input images, e.g. due to non-uniformness in the camera photometric 
response and due to scene shading variations. On the overlap region lo, the cost 
function Ep penalizes for derivatives which are inconsistent with any of the 
input images. In image locations where both Ji and I 2 have low gradients, Ep 
penalizes for high gradient values in the mosaic image. This property is useful 
in eliminating false stitching edges. 

The choice of norm (parameter p) has implications on both the optimization 
algorithm and the mosaic image. The minimization of Ep (Equation 1) for p > 1 
is convex, and hence efficient optimization algorithms can be used. Section 3 
describes a minimization scheme for E 2 by existing algorithms, and a novel fast 
minimization scheme for E\ . The mask image IT was either a uniform mask (for 
El) or the Feathering mask (for E 2 ), which is linear with the signed-distance 
from the seam. The influence of the choice of p on the result image is addressed in 
the following sections, with the introduction of alternative stitching algorithms 
in the gradient domain. 
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Optimal seam Optimal seam on the gradients 




Pyramid blending Pyramid blending on the gradients 




Feathering GISTl 



Fig. 3. Stitching in the gradient domain. The inpnt images appear in Figure 1, with 
the overlap region marked by a black rectangle. With the image domain methods 
(top panels) the stitching is observable. Gradient-domain methods (bottom panels) 
overcome global inconsistencies. 



2.2 GIST2: Stitching Derivative Images 

A simpler approach is to stitch the derivatives of the input images: 

1. Compute the derivatives of the input images 

2. Stitch the derivative images to form a field F = {F^, Fy). F^ is obtained by 
stitching ^ and and Fy is obtained by stitching ^ and 

3. Find the mosaic image whose gradients are closest to F. This is equivalent 
to minimizing dp(VI, F,tt,U) where tt is the entire image area and ?7 is a 
uniform image. 

In stage (2) above, any stitching algorithm may be used. We have experimented 
with Feathering, pyramid blending [7], and optimal seam. For the optimal seam 
we used the algorithm in [2], finding the curve x = f(y) that minimizes the 
sum of absolute differences in the input images. Stage (3), the optimization 
under £i,£ 2 , is described in Section 3. Unlike the GISTl algorithm described 
in the previous section, we found minor differences in the result images when 
minimizing dp under £i and £ 2 - 
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2.3 Which Method to Use? 

In the previous sections we presented several stitching methods. Since stitching 
results are tested visually, selecting the most appropriate method may be subject 
to personal taste. However, a formal analysis of properties of these methods is 
provided below. Based on those properties in conjunction with the experiments 
in Section 4, we recommend using GISTl under 

Theorem 1 . Let I\,l2 he two input images for a stitching algorithm, and as- 
sume there is a curve x = f{y), such that for each q € {(f(y),y)}, Ii(q) = I2{q)- 
Let U he a uniform image. Then the optimal seam solution L , defined helow, is 
a global minimum of Ep{L; Ii, L2,U) defined in Eq.l, for any 0 < p < 1. 

T\h(.x,y) x<f{y) 

\L2{x,y) x>f{y) 

The reader is referred to [16] for a proof. The theorem implies that GISTl under 
£i is as good as the optimal seam methods when a perfect seam exists. Hence 
the power of GISTl under £i to overcome geometric misalignments similarly to 
the optimal seam methods. The advantage of GISTl over optimal seam methods 
is when there is no perfect seam, for example due to photometric inconsistencies 
between the input images. This was validated in the experiments. 

We also show an equivalence between GISTl under £2 and Feathering of 
derivatives (GIST2) under £2 (Note that feathering derivatives is different from 
Feathering the images). 

Theorem 2 . Let I\,l2 he two input images for a stitching algorithm, and let 
W he a Feathering mask. Letuj, the overlap region o//i,/2, he the entire image 
(without loss of generality, asW{q) = \ for q G ti, and W = 0 for q G T2). Let 
Icist he the minimum of E2{I; ?i, ? 2 , W) defined in Eq. 1. Let E he the following 
field: 

E = W{q)Vh{q) + (1 - W(g))V/ 2 (g) 

Then I Gist is the image with the closest gradient field to F under £2. 



The proof can be found in [16] as well. This provides insight into the difference 
between GISTl under £\ and under £ 2 : Under £ 2 , the algorithm tends to mix 
the derivatives and hence blur the texture in the overlap region. Under £ 1 , the 
algorithm tends to behave similarly to the optimal seam methods, while reducing 
photometric inconsistencies. 



3 Implementation Details 

We have implemented a minimization for Equation 1 under £\ and under £ 2 . 

Equation 1 defines a set of linear equations in the image intensities, with the 
derivative filters as the coefficients. Similarly to [12,13], we found that good re- 
sults are obtained when the derivatives are approximated by forward-differencing 
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filters I [1 — 1] . In the case, the results were further enhanced by incorpo- 

rating additional equations using derivative filters in multiple scales. In our ex- 
periments we added the filter corresponding to forward-differencing in the 2nd 
level of a Gaussian pyramid, obtained by convolving the filter [1 0 —1] with a 
vertical and a horizontal Gaussian filter [1 2 1] ). Golor images were handled 
by applying the algorithm to each of the color channels separately. 

The minimum to Equation 1 under ^2 with mask W is shown in [16] to be 
the image with the closest derivatives under (,2 to F, the weighted combination 
of the derivatives of the input images: 

( W{q)Vh{p) qGn 

F = I W{q)Vh{x,y) + (1 - W{q))Vl 2 {x,y)) q G oj 
[ ^h{x,y) q G T2 

The solution can be obtained by various methods, e.g. de-convolution [12], EFT 
[17] or multigrid solvers [18]. The results presented in this paper were obtained 
by EFT. 

As for the optimization, we found using a uniform mask U to be sufficient. 
Solving the linear equations under ii can be done by linear programming[19]: 

Min:J2i{z+ + z~) 

Subject to : Ax + {z'^ — z~) = b,x > 0, z~^ > 0, z~ >0 

The entries in matrix A are defined by the coefficients of the derivative filters, 
and the vector b contains the derivatives of I\,l 2 - x, is a vectorization of the 
result image. 

The linear program was solved using LOQO [20] . A typical execution time for 
a 200 X 300 image on a Pentium 4 was around 2 minutes. Since no boundary 
conditions were used, the solution was determined up to a uniform intensity shift. 
This shift can be determined in various ways. We chose to set it according to the 
median of the values of the input image Ii and the median of the corresponding 
region in the mosaic image. 



3.1 Iterative £x Optimization 

A faster £i optimization can be achieved by an iterative algorithm in the image 
domain. One way to perform this optimization is described in the following. Due 
to space limitation, we describe the algorithm when the forward differencing 
derivatives are used with kernel | [1 — 1] . The generalization to other filters 

and a parallel implementation appear in [16]. Let Dxj,Dyj be the forward- 
differences of input image Ij . The optimization is performed as follows: 



— Initialize the solution image / 

— Iterate until convergence: 

• for all x,y in the image, update I{x,y) to be: 



2 * median{Uj{ 



I{x + l,y)-Dxj{x,y),I{x - l,y)+Dxj{x - l,y), 
I{x,y+ l)-Dyj{x,y), I{x,y - 1)+Dyj{x,y- 1) 



}) 



(3) 
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For an even number of samples, the median is taken to be the average of the 
two middle samples. In regions Tj where a single image Ij is used, the median is 
taken on the predictions of I{x,y) given its four neighbours and the derivatives 
of image Ij. For example, when the derivatives of image Ij are 0, the algorithm 
performs an iterated median filter of the neighbouring pixels. In the overlap 
region lo of Ii,l 2 , the median is taken over the predictions from both images. 

At every iteration, the algorithm performs a coordinate descent and improves 
the cost function until convergence. As the cost function is bounded by zero, the 
algorithm always converges. However, although the cost function is convex, the 
algorithm does not always converge to the global optimum^. To improve the 
algorithm convergence and speed, we combined it in a multi-resolution scheme 
using multigrid [18]. In extensive experiments with the multi-resolution extension 
the algorithm always converged to the global optimum. 

4 Experiments 

We have implemented various versions of GIST and applied them to panoramic 
mosaicing and object blending. 

First, we compared GIST to existing image stitching techniques, which work 
on the image intensity domain: Feathering [6], Pyramid Blending [7], and ’op- 
timal seam’ (Implemented as in [2]). The experiments (Figure 3) validated the 
advantage in working in the gradient for overcoming photometric inconsistencies. 
Second, we compared the results of GISTl (Section 2.1), GIST2 (Section 2.2) 
and the method by Perez, et. al. [10]. Results of these comparisons are shown, 
for example, in Figures 4,5, and analyzed in the following sections. 

4.1 Stitching Panoramic Views 

The natural application for image stitching is the construction of panoramic 
pictures from multiple input pictures. Geometrical misalignments between input 
images are caused by lens distortions, by the presence of moving objects, and 
by motion parallax. Photometric inconsistencies between input images may be 
caused by a varying gain, by lens vignetting, by illumination changes, etc. 

The input images for our experiments were captured from different camera 
positions, and were aligned by a 2D parametric transformation. The aligned 
images contained local misalignments due to parallax, and photometric incon- 
sistencies due to differences in illumination and in camera gain. Mosaicing re- 
sults are shown in Figures 3,4,5. Figure 3 compares gradient methods vs. image 
domain methods. Figure 4,5 demonstrate the performance of the stitching al- 
gorithms when the input images are misaligned. In all our experiments GISTl 
under ii gave the best results, in some cases comparable with other methods: In 
Figure 4 comparable with Feathering, and in 5 comparable with ’optimal seam’. 

^ Consider an image whose left part is white and the right part is black. When applying 
the algorithm on the derivatives of this image, the uniform image is a stationary 
point. 
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(a) (b) (c) (d) (e) (f) (g) (h) 



Fig. 4. Comparing various stitching methods. On top are the input image and the 
result of GISTl under l\. The images on bottom are cropped results of various meth- 
ods. (a)-Optimal seam, (b)-Feathering, (c)-Pyramid blending, (d)-Optimal seam on 
the gradients, (e)-Feathering on the gradients, (f)-Pyramid blending on the gradients, 
(g)-Poisson editing [10] and (h) GISTl - The seam is visible in (a),(c),(d),(g). 



Whenever the input images were misaligned along the seam, GISTl under 
was superior to [10]. 

4.2 Stitching Object Parts 

Here we combined images of objects of the same class having different appear- 
ances. Objects parts from different images were combined to generate the final 
image. This can be used, for example, by the police, in the construction of a 
suspect’s composite portrait from parts of faces in the database. Figure 6 shows 
an example for this application, where GISTl is compared to pyramid blending 
in the gradient domain. Another example for combination of parts is shown in 
Figure 7. 
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(a) (b) (c) (d) (e) (f) (g) (h) 



Fig. 5. A comparison between various image stitching methods. On top are the input 
image and the result of GISTl under l\. The images on bottom are cropped from 
the results of various methods. (a)-Optimal seam, (b)-Feathering, (c)-Pyramid blend- 
ing, (d)-Optimal seam on the gradients, (e)-Feathering on the gradients, (f)-Pyramid 
blending on the gradients, (g)-Poisson editing [10] and (h) GISTl - When there are 
large misalignments, optimal seam and GISTl produce less artifacts. 

5 Discussion 

A novel approach to image stitching was presented, with two main components: 
First, images are combined in the gradient domain rather than in the intensity 
domain. This reduces global inconsistencies between the stitched parts due to 
illumination changes and changes in the camera photometric response. Second, 
the mosaic image is inferred by optimization over image gradients, thus reducing 
seam artifacts and edge duplications. Experiments comparing gradient domain 
stitching algorithms and existing image domain stitching show the benefit of 
stitching in the gradient domain. Even though each stitching algorithm works 
better for some images and worse for others, we found that GISTl under 
always worked well and we recommend it as the standard stitching algorithm. 
The use of the norm was especially valuable in overcoming geometrical mis- 
alignments of the input images. 
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Fig. 6. A police application for generating composite portraits. The top panel shows 
the image parts used in the composition, taken from the Yale database. The bottom 
panel shows, from left to right, the results of pasting the original parts, GISTl under 
GISTl under li and pyramid blending in the gradient domain. Note the discontinuities 
in the eyebrows. 




(a) (b) (c) (d) 



Fig. 7 . A combination of images of George W. Bush taken at different ages. On top 
are the input images and the combination pattern. On the bottom left are, from left 
to right, the results of GISTl Stitching under l\ (a) and under £2 (b), the results 
of pyramid blending in the gradient domain (c), and pyramid blending in the image 
domain(d). 
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The closest approach to ours was presented recently by Perez et. al. [10] for 
image editing. There are two main differences with this work: First, in this work 
we use the gradients of both images in the overlap region, while Perez et. al. 
use the gradients of the inserted object and the intensities of the background 
image. Second, the optimization is done under different norms, while Perez et. 
al. use the ^2 norm. Both differences considerably influence the results, especially 
in misaligned textured regions. This is shown in Figures 5,4. 

Image stitching was presented as a search for an optimal solution to an im- 
age quality criterion. The optimization of this criterion under norms £ 1,^2 is 
convex, having a single solution. Encouraged by the results obtained by this ap- 
proach, we believe that it will be interesting to explore alternative criteria for 
image quality. One direction can use results on statistics of filter responses in 
natural images [21,22,23]. Another direction is to incorporate additional image 
features in the quality criterion, such as local curvature. Successful results in im- 
age inpainting[ll,24] were obtained when image curvature was used in addition 
to image derivatives. 



Acknowledgments. The authors would like to thank Dhruv Mahajan and 
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Abstract. In this paper, we propose a robust motion segmentation 
method using the techniques of matrix factorization and subspace sepa- 
ration. We first show that the shape interaction matrix can be derived us- 
ing QR decomposition rather than Singular Value Decomposition(SVD) 
which also leads to a simple proof of the shape subspace separation theo- 
rem. Using the shape interaction matrix, we solve the motion segmenta- 
tion problems by the spectral clustering techniques. We exploit multi-way 
Min-Max cut clustering method and provide a novel approach for cluster 
membership assignment. We further show that we can combine a cluster 
refinement method based on subspace separation with the graph clus- 
tering method to improve its robustness in the presence of noise. The 
proposed method yields very good performance for both synthetic and 
real image sequences. 



1 Introduction 

The Matrix factorization methods proposed by Tomasi, Costeira and Kanade 
[1] [2] have been widely used for solving the motion segmentation problems 
[3] [4] [5] [6] [7] [8] and the 3D shape recovering problems [9] [10] [11]. The 
basic idea of the methods is to factorize the feature trajectory matrix into the 
motion matrix and the shape matrix, providing the separation of the feature 
point trajectories into independent motions. In this paper, we develop a novel 
robust factorization method using the techniques of spectral clustering. 

Given a set of N feature points tracked through F frames, we can construct 
a feature trajectory matrix P G where the rows correspond to the x or y 

coordinates of the feature points in the image plane and the columns correspond 
to the individual feature points. Motion segmentation algorithms based on ma- 
trix factorization [6] first construct a shape interaction matrix, Q by applying 
the singular value decomposition (SVD) to the feature trajectory matrix P. Un- 
der the noise-free situation, the shape interaction matrix Q can be transformed 
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to a block diagonal matrix by a symmetric row and column permutation thereby 
grouping the feature points of the same object into a diagonal block. 

If the trajectory matrix P is contaminated by noise, however, the block di- 
agonal form of Q no longer holds, and the methods such as the greedy technique 
proposed in [2] tend to perform rather poorly. Recently there have been several 
research proposed specifically addressing this problem [7] [5] [3] [4] [5] [6] [8]. We 
will give a brief review of these methods in Section 2. 

In this paper we deal with the issues related to the robustness of the fac- 
torization methods. We first show that the shape interaction matrix can be 
extracted from the trajectory matrix using QR decomposition with pivoting, 
an idea that was briefly mentioned in [2] . As a by-product we give a simple and 
clean proof of the subspace separation theorem described in [6] . We then observe 
that the shape interaction matrix is very similar to the weight matrix used for 
graph partitioning and clustering [12] [13] [14] [15], and the motion segmenta- 
tion problem can be cast as an optimal graph partitioning problem. To this end, 
we apply the spectral k-way clustering method [13] [14] to the shape interaction 
matrix to transform it into near-block diagonal form. In particular, we propose a 
novel QR decomposition based technique for cluster assignment. The technique 
at the same time also provides confidence levels of the cluster membership for 
each feature point trajectory. The confidence levels are explored to provide a 
more robust cluster assignment strategy: we assign a feature point directly to a 
cluster when it has a very confidence level for the cluster compared to those for 
other clusters. Using the assigned feature points in each cluster, we compute a 
linear subspace in the trajectory space. The cluster memberships of other fea- 
ture points having lower confidence levels, and are therefore not assigned to a 
cluster, are determined by their distances to each of the linear subspaces. Our 
experiments on both synthetic data sets and real video images have shown that 
this method are very reliable for motion segmentation even in the presence of 
severe noise. 

The rest of the paper is organized a s follows: Previous works are discussed 
in Section 2. Section 3 is devoted to a simple proof that the shape interaction 
matrix can be computed using QR decomposition. Motion segmentation based 
on spectral relaxation k-way clustering and subspace separation is described in 
Section 4. Experiment results are shown in Section 5 and conclusion is given in 
Section 6. 



2 Previous Work 

The factorization method was originally introduced by Tomasi and Kanade [1]. 
The method decomposes a matrix of image coordinates of N feature points 
tracked through F frames into two matrices which, respectively, represent object 
shape and camera motion. The method deals with a single static object viewed 
by a moving camera. Extending this method, Costerira and Kanade [2] proposed 
a multibody factorization method which separates and recovers the shape and 
motion of multiple independently moving objects in a sequence of images. To 
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achieve this, they introduce a shape interaction matrix which is invariant to 
both the object motions and the selection of coordinate systems, and suggest a 
greedy algorithm to permute the shape interaction matrix into block diagonal 
form. Gear [3] exploited the reduced row echelon form of the shape interaction 
matrix to group the feature points into the linearly independent subspaces. For 
Gear’s method, in the noise-free case, any two columns of the echelon form which 
have nonzero elements in the same row correspond to feature points belonging to 
the same rigid body. The echelon form matrix can be represented by a weighted 
bipartite graph. Gear also used a statistical approach to estimate the grouping 
of feature points into subspaces in the presence of noise by computing which 
partition of the graph has the maximum likelihood. 

Ichimura [4] suggested a motion segmentation method based on discriminant 
criterion [16] features. The main idea of the method is to select useful features 
for grouping noisy data. Using noise-contaminated shape interaction matrix, it 
computes discriminant criterion for each row of the matrix. The feature points 
are then divided into two groups by the maximum discriminant criterion, and 
the corresponding row gives the best discriminant feature. The same procedure 
is applied recursively to the remaining features to extract other groups. Wu 
et. al. [5] proposed an orthogonal subspace decomposition method to deal with 
the noisy problem of the shape interaction matrix. The method decomposes the 
object shape space into signal subspaces and noise subspaces. They used the 
shape signal subspace distance matrix, D, for shape space grouping rather than 
the noise-contaminated shape interaction matrix. 

Kanatani [6] [7] reformulated the motion segmentation problems based on 
the idea of subspace separation. The approach is to divide the given N feature 
points to form m disjoint subspaces It, i = 1, • • • , m. A rather elaborated proof 
was given showing that provided that the subspaces are linearly independent, 
the elements Qij in the shape interaction matrix Q is zero if the point i and 
the point j belong to different subspaces. Kanatani also pointed out that even a 
small noise in one feature point can affect all the elements of Q in a complicated 
manner. Based on this fact, Kanatani proposed noise compensation methods 
using the original data rather than the shape interaction matrix Q. 

Zelnik-Manor and Irani [8] showed that different 3D motions can also be 
captured as a single object using previous methods when there is a partial de- 
pendency between the objects. To solve the problem, they suggested to use an 
affinity matrix Q where Qij = exp{vk{i) — Vk{j)Y, where Vk's are the largest 
eigenvectors of Q. They also dealt with the multi-sequence factorization prob- 
lems for temporal synchronization using multiple video sequences of the same 
dynamic scene. 



3 Constructing the Shape Interaction Matrix Using QR 
Decomposition 

In this section, we exhibit the block diagonal form of the shape interaction matrix 
using QR decomposition with pivoting [17], this also provides a simpler proof of 
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the shape subspace separation theorem (Theorem 1 in [6]). Assume we have N 
rigidly moving feature points, Pi, ■ ■ ■ ,Pni which are on image plane correspond- 
ing 3D points over the F frames. Motion segmentation can be interpreted as 
dividing the feature points pi into S groups [6] each spanning a linear subspace 
corresponding to feature points belonging to the same object. We denote the 
grouping as follows, 



s 

{i,...,iV}= |Ji„ = 0. 

i=l 



Now define li = \Xj\ which is the number of the points in the set 2^, and 
fci=dim span{pj}.^^, < k and Pi = {pj}j(zi^. 

Let the SVD of h be Pi = U,S,Vl, where Si G = 1, . . . , S'. Then 

P = [Pi, P 2 , ■ • ■ , Ps] can be written as. 



P = [Pi, P2, . . . , P,] = [PiAi, U2S2 , . . . , UsS,] 



0 

0 V? 



0 0 



0 

0 

Vl 



( 1 ) 



where rank(Ti)=fci for i = 1, • • • , s. We assume the S subspaces span{pj } , i = 
1, . . . , S are linearly independent, then the matrix [Pi Ai, U 2 S 2 , . . . , UsSg] has 
full column rank oi k = ki + ■ ■ ■ + kg. Therefore, an arbitrary orthonormal basis 
for the row space of P can be written as <Pdiag{Vx, - ■ ■ ^Vg)'^ for an arbitrary 
orthogonal matrix <P G Now the shape interaction matrix can be written 

as 



Q = diag{Vi, ■■■ , Vg)^’^Miag{Vi, ■■■ , Vg)'^ = diag{ViVT, ■■■, VgVj'). 



This clearly shows that Qij = 0 if t and j belong to different subspaces, i.e., if 
the corresponding feature points belong to different objects. 

A cheaper way to compute an orthonormal basis for the row-space of P than 
using SVD is to apply QR decomposition with column pivoting to P^, 

P^P = QR (2) 

where P is a permutation matrix, and Q has k columns. It is easy to see that 
= Q. In the presence of noise, P will not exactly have rank k, but QR 
decomposition with column pivoting will in general generate an R matrix that 
can reliably revealing the numerical rank of P. We can truncate R by deleting 
rows with small entries. 
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4 Motion Segmentation 



4.1 Spectral Multi-way Clustering 

In the last section, we have shown that the shape interaction matrix, Q G 
has the block diagonal form when the feature points are grouped into indepen- 
dent subspaces corresponding to S different objects. In general, this grouping 
is unknown, and we need to find row and column permutations of the matrix 
Q to exhibit this block diagonal form, and thus assigning the feature points to 
different objects. A greedy algorithm has been proposed in [2] for this problem, 
but it performs poorly in the presence of noise. We now present a more robust 
method based on spectral graph clustering [12] [13] [14] [15]. We propose a novel 
technique for cluster assignment in spectral clustering and show that it provides 
a confidence level that can be used for further refining the cluster memberships 
of the feature points, thus improving the robustness of the spectral clustering 
method. 

We consider the absolute value of the (i,j) element of the shape interaction 
matrix Q as a measure of the similarity of feature points i and j with feature 
points belonging to the same object more similar than those of other points. 
In fact, in the noise-free case, feature points in different objects will have zero 
similarity. Our goal is then to partition the feature points into S groups so that 
feature points are more similar within each group than across different groups. 
Let W = (wij) with Wij = IQijj. For a given partition of the feature points 
into S groups, we can permute the rows and columns of W so that rows and 
columns corresponding to the feature points belonging to the same objects are 
adjacent to each other, i.e., we can re-order the columns and rows of the W 
matrix accordingly such that 



W = 



Wn Wi2 • • • Wis 

W21 W22 • • • W2S 



Wsi Ws2 ■ ■ ■ Wss 



(3) 



We want to find a partition such that Wu will be large while Wij ,i ^ j will be 
small, and to measure the size of a sub-matrix matrix Wij we use the sum of 
all its elements and denoted as sum(Wij). Let Xi be a cluster indication vector 
accordingly partitioned with that of W with all elements equal to zero except 
those corresponding to rows of Wu, 

Xi = [0---0,l---l,0---0]^. 

Denote D = diag{Di,D 2 , • • • , Ds) such that Di = X)j=i is easy to see 

that 

sum(Wii) = xfWxi, sum (Wij) = xf {D — W)xi. 

Since we want to find a partition which will maximize sum(Wii) while mini- 
mizing smn(Wij),i yf j, we seek to minimize the following objective function by 
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finding a set of indicator vectors Xi. The objective function is called min-max cut 
in [13] [14] which is a generalization of the normalized cut objective function [12] 
to the multi-way partition case. 



MCut = 



xJ{D — W)x\ X2{D — W)x 2 x'g{D — W)xs 

XiWxi X2WX2 x'gWxs 

xfDxi x'^Dx2 x^Dxs _ 

xJWxi X2WX2 x'gWxs 



If we define yi = D'^/‘^Xi/\\D'^/'^Xi \\2 and Ys = [t/i, • • • ,?/s], we have 



MCut = — 1 — — — 

yiWyi yiWy2 



Vs^ys 



(4) 



where W = D and yi = K is easy to see that the yi 

are orthogonal to each other and normalized to have Euclidean norm one. If we 
insist that the yi be constrained to inherit the discrete structure of the indicator 
vectors Xi, then we are leading to solve a combinatorial optimization problem 
which has been proved to be NP-hard even when S' = 2 [12]. The idea of spectral 
clustering instead is to relax this constraints and allows the yi to be an arbitrary 
set of orthonormal vectors. In this case, the minimum of Eq. 4 can be shown 
to be achieved by orthonormal basis yi, - ■ ■ ,ys of the subspace spanned by the 
eigenvectors corresponding to the largest S eigenvalues of W. Next we discuss 
how to assign the feature points to each clusters based on the eigenvectors. 

We should first mention that the cluster assignment problem in spectral 
clustering is not well-understood yet. Here we follow the approach proposed 
in [15]. Denote Y = [yi, • • • ,ys]'^ as the optimal solution of Eq. 4. The vectors jji 
can be used for cluster assignment because iji « D^^‘^Xi/\\D^^'^Xi\\ 2 , where Xi is 
the cluster indicator vector of i — th cluster. Ideally, if W is partitioned perfectly 
into S clusters, then, the columns in X = [xi, - ■ ■ ,xs]'^ of the i — th cluster 
are the same, one for the i — th row and zeros for the others. Two columns of 
different clusters are orthogonal to each other. This property is approximately 
inherited by Y : two columns from two different clusters are orthogonal to each 
other, and those from one cluster are the same. We now pick a column of Y which 
has the largest norm, say, it belongs to cluster f, we orthogonalized the rest of 
the columns of Y against this column. We assign the columns to cluster i whose 
residual is small. We then perform this process S times. As discussed in [15], it is 
exactly the same procedure of QR decomposition with column pivoting applied 
to Y . In particular, we compute QR decomposition of Y^ with column pivoting 



Y^E = QR = Q[Rii,Ri2] 



where Q is a S' x S' orthogonal matrix, i?n is a S x S upper triangular matrix, 
and if is a permutation matrix. Then we compute a matrix R as 



R— [i?ii, i?i 2 ]P^ — [Is, Rii R 12 ], 
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The matrix R G can be considered as giving the levels of confidence of a 

point to be assigned to each cluster. Notice that the columns correspond to the 
feature points and the rows correspond to the clusters. The cluster membership 
of each feature point is determined by the row index of the largest element in 
absolute value of the corresponding column of R. This provide us with a baseline 
spectral clustering method for motion segmentation which are quite robust in 
the presence of noise. Further improvement can be achieved as we discuss next. 

We can assign a point to a cluster with high confidence if there is a very 
dominantly high confidence value in the corresponding column, however, we are 
not able to do this if two or more values in a column are very close to each other. 
Table 4.1 shows an example of the matrix R G that has 10 points extracted 

from 3 objects. The last row of the table shows the cluster membership of each 
point assigned by the row index of the highest absolute value. For instance, 
the point pi is assigned to cluster 2 because the second row value (0.329), is 
greater than the other row values (0.316 and 0.203). However, we cannot have 
much confidence of its membership because there is no dominant values in the 
corresponding column. 



Table 1. An example of the matrix R. There are 10 points extracted from 3 objects. 
The last row shows the assigned cluster 



Cluster ID 


pi 


p2 


p3 


p4 


p5 


p6 


p7 


p8 


p9 


plO 


k = 1 


0.316 


0.351 


0.876 


0.331 


0.456 


0.562 


0.086 


0.275 


0.072 


0.119 


k = 2 


0.329 


0.338 


0.032 


0.372 


0.013 


0.060 


0.186 


0.706 


0.815 


0.831 


k = 3 


0.203 


0.017 


0.031 


0.173 


0.566 


0.556 


0.775 


0.126 


0.094 


0.113 


Assigned Cluster 


2 


1 


1 


2 


3 


1 


3 


2 


2 


2 



4.2 Refinement of Cluster Assignment for Motion Segmentation 

The baseline spectral clustering shows its robustness for a noisy environment in 
spite of its hard clustering (it assigns each point to a cluster even though it does 
not have high confidence for it). The method alone, however, can sometimes 
fail in presence of severe noise. In this section, we discuss a two-phase approach 
whereby in phase one we assignment the cluster memberships for those feature 
points with high confidence levels, and in phase two we construct linear subspaces 
for each clusters based on the high confidence feature points, and assign the rest 
of the feature points by projecting onto these subspaces. 

Our approach proceeds as follows. After computing R discussed in the pre- 
vious section, the points of high confidence of each clusters are selected. Let’s 
define Pi = [pu, ■ ■ ■ ,PiNi] as the trajectory points in the cluster i. One of the 
easiest methods is to apply threshold to the values of each column, and if the 
highest value in the column is greater than the threshold, the point is assigned 
to the corresponding cluster, if it does not, let’s categorize the point to clus- 
ter 0 which is in the state of temporarily pending to decide its cluster. Let’s 
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(b) ^1 

Fig. 1. Two synthetic video sequences used in [7] and [18] respectively, (a). 9 Red dots 
are foreground points and 20 green dots are background pixels (b). 24 background 
points and 14 foreground points. The foreground pixels are connected with lines. 



define the pending points as Pq = [poi, ■ ■ ■ ,poAfo]- The next step is to compute 
subspace(2D) for pn, ■ ■ ■ ,PiNi, * = Ij • • • ; 'S' using Principal Component Analysis 
(PCA). Let’s denote Ui as a subspace basis for the cluster i. We finally deter- 
mine the cluster membership of each pending point by computing the minimum 
distance from the point to subspaces. 

6j = argminWpoj - (c* -h UiUf{poj - Ci))||^, 

I 

where j = 1, • • • , A: and Ci = YfjLi Pd- 

The point poj is assigned to the cluster 9j . 

5 Experimental Results 

Figure 1 shows two synthetic image sequences used for performance evaluation. 
Actually these images are used in [7] and [18]. Figure l-(a), denoted as Synthetic 
7, has 20 background points(green dots) and foreground points(red dots), and 
Figure l-(b), denoted as Synthetic 2, has 20 background points and 14 foreground 
points. The foreground points are connected by lines for visualization purpose. 

We performed experiments using not only the original tracking data but also 
the data added by independent Gaussian noise of mean 0 and standard deviation 
cr to the coordinates of all the points. For the noise data, we generate 5 sets for 
each cr = 1,2, 3, 4, and compute the misclassification rate by simply averaging 
the 5 experiment results. We compare two methods proposed in this paper (One 
is fc-way Min-Max cut clustering in Sec. 4.1 denoted as Method 1, and the other 
is a combination of the fc-way Min-Max cut clustering and clustering refinement 
using subspace projection in Sec. 4.2 denoted as Method 2) to the Multi- 
stage optimization proposed in [18] denoted as Multi-Stage. Table 2 shows 
that the misclassification rates of the three methods over the different noise 
levels (cr = 0, 1,2, 3, 4). Method 2 and Multi-Stage yields better performance 
than Method 1. The two methods performs almost perfect for the sequences. 
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Table 2. Misclassification rate (%) for two synthetic sequences. The values in parenthe- 
sis are standard deviation. Method 1 is fe-way Min-Max cut clustering in Sec. 4.1 and 
Method 2 is the fc-way Min-Max cut clustering -|- clustering refinement using subspace 
projection in Sec. 4.2. Multi-Stage is the Multi-stage optimization proposed in [18]. 



Video Sequence 


noise 


cr = 0 


cr = 1 


cr = 2 


cr = 3 


cr = 4 


Synthetic 1 


Method 1 


0.0 


1.4 


1.4 


0.7 


0.7 


Method 2 


0.0 


0.0 


0.0 


0.0 


0.0 


Multi-Stage 


0.0 


0.0 


0.7 


0.0 


0.0 


Synthetic 2 


Method 1 


8.2 


10.6(1.6) 


11.7(2.1) 


11.7(3.6) 


13.2(1.7) 


Method 2 


0.0 


0.0 


0.0 


0.0 


0.59(1.3) 


Multi-Stage 


0.0 


0.0 


0.0 


0.0 


0.59(1.3) 



We experimented with the real video sequences used in [18]. In all the se- 
quences, one object is moving while background is simultaneously moving be- 
cause of the camera moving. Let’s denote the video sequences as videol, video2 
and videos respectively. We synthesize one more test video sequence by over- 
laying the foreground feature points in videol to video2, which has 2 moving 
objects and background. Let’s denote the video sequences as video4- Figure 2 
shows selected 5 frames of the four sequences. 

We also performed experiments using not only the original tracking data but 
also the data added by independent Gaussian noise of mean 0 and standard 
deviation cr to the coordinates of all the points. For the noise data, we generate 

5 sets for each a = 3,5,7, 10, and compute the misclassification rate by simply 
averaging the 5 experiment results. 

Table 3 shows the misclassification rates of the three methods over the differ- 
ent noise levels (cr = 0, 3, 5, 7, 10). The table shows that Method 2 can classify 
motion perfectly even for the severe noise presence. It is very robust and sta- 
ble to noise. Method 1 performs very well for noise-free environment, but it 
misclassifies some points in the presence of noise. 

Multi-Stage performs very well for videol through videoS which have one 
moving foreground object and background. It, however, does not yield good 
performance for video4 which has two moving foreground objects and background 
in the presence of noise. Based on our experiments, the method also suffer from 
local minima problem. Using the same data, it yields different results based on 
the initialization. That is the reason the standard deviation of the method is too 
high shown in Table 3. 

6 Conclusions 

In this paper, we mathematically prove the shape interaction matrix can be 
computed using QR decomposition which is more effective than SVD. We solve 
the motion segmentation problem using spectral graph clustering technique be- 
cause the shape interaction matrix has a very similar form to the weight ma- 
trix of graph. We apply the Spectral Relaxation K-way Min-Max cut clustering 
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Fig. 2. Real Video sequences with the feature points. 1“* row: videol, 2"'* row: video2, 
3’’'* row: video3, 4*^ row: video4 (foreground feature points in videol are overlaid in 
video4)- Red dots correspond to background while green dots correspond to foreground. 
The yellow cross marks in video4 represent the foreground feature points of videol 

Table 3. Misclassification rate (%) for the real video sequences. The values in paren- 
thesis are standard deviation. 



Video Sequence 


noise 


a = 0 


a = 3 


a = 5 


a = 7 


cr = 10 


videol 


Method 1 


0.0 


0.0 


0.0 


0.0 


0.0 


Method 2 


0.0 


0.0 


0.0 


0.0 


0.0 


Multi-Stage 


0.0 


0.0 


0.0 


0.0 


0.0 


video2 


Method 1 


0 


1.6(1.2) 


1.6(1.2) 


1.6(1. 2) 


2.9(1.7) 


Method 2 


0.0 


0.0 


0.0 


0.0 


0.0 


Multi-Stage 


0.0 


0.0 


0.0 


0.0 


7.3(16.3) 


videos 


Method 1 


0.0 


2.5(0.01) 


2.5 


1.3 


2.5 


Method 2 


0.0 


0.0 


0.0 


0.0 


0.0 


Multi-Stage 


0.0 


0.0 


0.0 


0.0 


0.0 


video4 


Methodl 


0.0 


0.7(1.6) 


3.4(4.7) 


8.3(5.0) 


9.6(6.5) 


Method2 


0.0 


0.0 


0.0 


0.0 


0.7(1.6) 


Multi-Stage 


0.0 


4.1(9.3) 


8.2(9.6) 


16.2 (13.2) 


19.23 (9.87) 



method [13] [14] to shape interaction matrix. It provides a relaxed cluster in- 
dication matrix. QR decomposition is applied to the matrix, which generate 
a new cluster indication matrix, to determine the cluster membership of each 
point. The values of the new cluster indication matrix reflect confidence level for 
each point to be assigned to clusters. This method yields a good performance 
in noise free environment, but it is, sometimes, sensitive to noise. We propose a 
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Fig. 3. Graph for misclassification rate. Graph of videol is not depicted here because 
all the three methods performs perfectly. Method 1: Dashed-dot blue line, Method 
2: Red line and Multi-Stage: Dashed green line 



robust motion segmentation method by combining the spectral graph clustering 
and subspace separation to compensate noise problem. Initially, we assign only 
points of high confidence to clusters based on the cluster indication matrix. We 
compute subspace for each cluster using the assigned points. We finally deter- 
mine the membership of the other points, which are not assigned to a cluster, 
by computing the minimum residual when they are projected to the subspace. 

We applied the proposed method to two synthetic image sequences and four 
real video sequences. Method 2 and Multi-Stage produce almost perfect per- 
formance for the synthetic image sequences in the presence of noise. Experiments 
also show that the proposed method. Method 2, performs very well for the real 
video sequences even in the sever noise presence. It performs better than Multi- 
Stage optimization method [18] for real video sequences in which there are more 
than two objects. 
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Learning Outdoor Color Classification from Just 
One Training Image 



Roberto Manduchi 
University of California, Santa Cruz 



Abstract. We present an algorithm for color classification with explicit 
illuminant estimation and compensation. A Gaussian classifier is trained 
with color samples from just one training image. Then, using a sim- 
ple diagonal illumination model, the illuminants in a new scene that 
contains some of the same surface classes are estimated in a Maximum 
Likelihood framework using the Expectation Maximization algorithm. 
We also show how to impose priors on the illuminants, effectively com- 
puting a Maximum-A-Posteriori estimation. Experimental results show 
the excellent performances of our classification algorithm for outdoor 
images.^ 



1 Introduction 

Recognition (or, more generally, classification) is a fundamental task in computer 
vision. Differently from clustering/segmentation, the classification process relies 
on prior information, in the form of physical modeling and/or of training data, 
to assign labels to images or image areas. This paper is concerned with the 
classification of outdoor scenes based on color. Color features are normally used 
in such different domains as robotics, image database indexing, remote sensing, 
tracking, and biometrics. Color vectors are generated directly by the sensor for 
each pixel, as opposed to other features, such as texture or optical flow, which 
require possibly complex pre-processing. In addition, color information can be 
exploited at the local level, enabling simple classifiers that do not need to worry 
too much about contextual spatial information. 

Color-based classification relies on the fact that a surface type is often 
uniquely characterized by its reflectance spectrum: different surface types usu- 
ally have rather different reflectance characteristics. Unfortunately, a camera 
does not take direct reflectance measurements. Even neglecting specular and 
non-Lambertian components, the spectral distribution of the radiance from a 
surface is a function of the illuminant spectrum (or spectra) as much as of the 
surface reflectance. The illuminant spectrum is in this context a nuisance pa- 
rameter, inducing an undesired degree of randomness to the perceived color of 
a surface. Unless one is interested solely in surfaces with a highly distinctive 
reflectance (such as the bright targets often used in laboratory robotics exper- 
iments), ambiguity and therefore misclassification will arise when surfaces are 
illuminated by “unfamiliar” light. 

^ This work was supported by DARPA through subcontract 1235249 from JPL. 
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One way to reduce the dependence on the illuminant is to use more training 
data and sophisticated statistical models to represent the color variability. This is 
feasible if the space of possible illuminants is not too broad, meaning that we can 
hope to sample it adequately. For example, in the case of outdoor scenes (which 
are of interest to this work), the spectrum of the illuminant (direct sunlight or 
diffuse light and shade) can be well modeled by a low-dimensional linear space. 
Thus, by collecting many image samples of the surfaces of interest under all ex- 
pected light conditions, one may derive the complete statistical distribution of 
colors within each class considered, and therefore build a Bayesian classifier that 
can effectively cope with variation of illumination. This approach was taken by 
the author and colleagues for the design and implementation of the color-based 
terrain typing subsystem of the experimental Unmanned Vehicle (XUV) DEMO 
III [5], which provided excellent classification performances. Unfortunately, col- 
lecting and hand-labeling extensive training data sets may be difficult, time- 
consuming, and impractical or impossible in many real-world scenarios. This 
prompted us to study an orthogonal approach, relying on a model-based, rather 
than exemplar-based, description of the data. Our algorithm aims to decou- 
ple the contribution of the reflectance and of the illumination components to 
the color distribution within each surface class, and to explicitly recover and 
compensate for variations of the illuminant (or illuminants) in the scene. Both 
components (reflectance and illuminant) are modeled by suitable (and simple) 
statistical distributions. The contribution of reflectance to the color distribution 
is learned by observing each class under just one “canonical” illuminant, possi- 
bly within a single training image. To model the contribution of illumination, 
one may either directly hard-code existing chromaticity daylight curves [4] into 
the system, or learn the relevant parameters from a data set of observations of 
a fixed target (such as a color chart) under a wide variety of illumination con- 
ditions. Note that illuminant priors learning is performed once and for all, even 
before choosing the classes of interest. 

The estimation of the illuminants present in the scene, together with the de- 
termination of which illuminant impinges on each surface element, is performed 
by a Maximum-A-Posteriori (MAP) algorithm based on the distributions es- 
timated in the training phase. Our formulation of the MAP criterion is very 
similar to the one by Tsin et al. [3]. Our work, however, differs from [3] in two 
main aspects. Firstly, [3] requires that a number of images of the same scene, 
containing the surface types of interest, are collected by a fixed camera under 
variable lighting conditions. While this training procedure may be feasible for 
surveillance systems with still cameras, it is impractical for other applications 
(such as robotics). As mentioned earlier, our system only requires one image 
containing the surfaces of interest under a single illuminant. Secondly, our algo- 
rithm is conceptually and computationally simpler than [3] . Instead of an ad-hoc 
procedure, we rely on the well-understood Expectation Maximization algorithm 
for illuminant parameter estimation. A simple modification of the EM algorithm 
allows us to include the prior distribution of the illuminant parameters for a truly 
Bayesian estimation. Illuminant priors are critical when dealing with scenes con- 
taining surfaces that were not used during training. Without prior knowledge of 
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the actual statistics of illuminant parameters, the system would be tempted to 
“explain too much” of the scene, that is, to model the color of a never seen before 
surface as the transformation of a known reflectance under a very unlikely light. 
To further shield the algorithm from the influence of “outlier” surfaces, we also 
augment the set of classes with a non-informative class distribution, a standard 
procedure in similar cases. The price to pay for the simplicity of our algorithm is 
a less accurate model of color production than in [3], which potentially may lead 
to lower accuracy in the illuminant compensation process. We use the diagonal 
model [8] to relate the variation of the illuminant spectrum to the perceived 
color. It is well known that a single diagonal color transformation cannot, in 
general, accurately predict the new colors of different surface types. However, 
we argue that the computational advantages of using such a simple model largely 
offset the possibly inaccurate color prediction. Note that other researchers have 
used the diagonal color transformation for classification purposes (e.g. [12]). 

2 The Algorithm 

Assume that K surface classes of interest have been identified, and that training 
has been performed over one or more images, where all samples used for training 
are illuminated by the same illuminant. Let p{c\k) denote the conditional like- 
lihood over colors c for the class model k, as estimated from the training data. 
The total likelihood of color c is thus 

K 

p{c) = '^PK{k)p{c\k) ( 1 ) 

fc=i 

where Pxik) is the prior probability of surface class k. In general, a scene to be 
classified contains a number of surfaces, some (but not all) of which belong to the 
set of classes used for training, and are illuminated by one or more illuminants 
which may be different from the illuminant used for training. Assume there are 
L possible illuminant types in the scene. Let c be the color of the pixel which is 
the projection of a certain surface patch under illuminant type 1. We will denote 
by Fi{c) the operator that transforms c into the color that would be seen if the 
illuminant of type I had the same spectrum as the one used for training, all 
other conditions being the same (remember that only one illuminant is used for 
training). Then, one may compute the conditional likelihood of a color c in a 
test image given surface class k, illuminant type I, and transformation Ff. 

pp{c\k,l)=p{Fi{c)\k)\J{Fi)\, (2) 

where \J{Fi)\c is the absolute value of the Jacobian of Fi at c. 

We will begin our analysis by making the following assumptions: 1) The 
surface class and illuminant type at any given pixel are mutually independent 
random variables; 2) The surface class and illuminant type at any given pixel 
are independent of the surface classes and illuminant types at nearby pixels; 3) 
The color of any given pixel is indepedent of the color of nearby pixels, even 
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when they correspond to the same surface class and illuminant type; 4) Each 
surface element is illuminated by just one illuminant. Assumption 1) is fairly well 
justified: the fact that a surface is under direct sunlight or in the shade should 
be independent of the surface type. It should be noticed, however, that in case 
of “rough” surfaces (e.g., foliage), self-shading will always be present even when 
the surface is under direct sunlight. Assumption 2) is not very realistic: nearby 
pixels are very likely to belong to the same surface class and illuminant type. 
This is indeed a general problem in computer vision, by no means specific to 
this particular application. We can therefore resort to standard approaches to 
deal with spatial coherence [16,15]. The “independent noise” assumption 3) is 
perhaps not fully realistic (nearby pixels of the same smooth surface under the 
same illuminant will have similar color), but, in our experience, it is a rather 
harmless, and computationally quite convenient, hypothesis. Assumption 4) is 
a very good approximation for outdoor scenes, where the only two illuminants 
(excluding inter-reflections) are direct sunlight and diffuse light (shade) [1]. 

With such assumptions in place, we may write the total log-likelihood of the 
collection of color points in the image, C, given the set of L transformations Fi, 
as 

L K 

Lf{C) = y^log EE PKL{k,l)pF{c{x)\k,l) (3) 

X 1^1 k^l 

L K 

X i = l k=l 

where c{x) is the color at pixel x, and we factorized the joint prior distribution 
of surface class and illuminant type {PKL{k,l) = PK{k)PL{l)) according to 
Assumption 1. Note that the first summation extends over all image pixels, 
and that L and K are the number of possible illuminants and surface classes, 
which are supposed to be known in advance. In our experiments with outdoor 
scenes, we always assumed that only two illuminants (sunlight and diffuse light) 
were be present, hence L=2. 

Our goal here is to estimate the L transformations {Fi} from the image C, 
knowing the conditional likelihoods p{c\k) and the priors Pxik) and Pl{1)- Once 
such transformations have been estimated, we may assign each pixel x a surface 
class k and illuminant type I as by 

{kj} = arg max PK{k)PL{l)p{Fi{c{x))\k)\J{Fi)\^(^^) (4) 

We will first present a ML strategy to determine {Fi}, which maximizes the total 
image log-likelihood (3). In Section 2.3, we will show how existing priors on the 
color transformations can be used in a MAP setting. 



2.1 Estimating the Illuminant Parameters 

As mentioned in the Introduction, we will restrict our attention to diagonal 
transformations of the type Fi{c) = Dic, where Di = diag (d;,i, c?i_ 2 , dj^s). Note 
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that in this case, \J{Fi)\c(x) = To make the optimization problem 

more tractable, we will assume that p{c\k) G While this Gaussian 

assumption may not be acceptable in general (especially for multimodal color 
distributions), it has been shown in the literature that mixtures of Gaussians can 
successfully model color distributions [5,19]. The extension of our optimization 
algorithm to the case of Gaussian mixtures is trivial. 

The optimal set of 3L color transformation coefficients {di^m} can be found 
using Expectation Maximization (EM) [13]. EM is an iterative algorithm that 
re-estimates the model parameters in such a way that the total image log- 
likelihood, Lp{C), is increased at each iteration. It is shown in Appendix A that 
each iteration is comprised of L independent estimations, one per illuminant 
type. For the /-th illuminant type, one needs to compute 



{di,m} = arg max 



3 

Ui ^ log \di^fn\ - 0.5 d'lGidi + H[di 

rh—1 



( 5 ) 



where di = (dz.i, d/, 2 , d;, 3 )^- The scalar ui, the 3x3 matrix Gi and the 3x1 vector 
Hi are defined in Appendix A, and are re-computed at each iteration. Our task 
now is to minimize (5) over (d/p, di^ 2 , di^s). Note that the partial derivatives with 
respect to di^m of the function to be maximized can be computed explicitely. 
Setting such partial derivatives to zero yields the following system of quadratic 
equations for m = 1,2,3: 



^ ^ Gl^m,sdl^sdl^m Hi xndl^rn — 0 

S 



(6) 



While this system cannot be solved in closed form, we note that if two variables 
(say, d;p and d;p) are kept fixed, then the partial derivative of (5) with respect 
to the third variable can be set to 0 by solving a simple quadratic equation. 
Hence, we can minimize (5) by using the Direction Set method [14], i.e. iterating 
function minimization over the three axes until some convergence criterion is 
reached. Note that this is a very fast maximization procedure, and that its 
complexity is independent of the number of pixels in the image. 



2.2 Outliers 

Any classification algorithm should account for “unexpected” situations that 
were not considered during training, by recognizing outlier or “none of the above” 
points. A popular strategy treats outliers as an additional class, for which a prior 
probability and a least informative conditional likelihood are defined [20]. For 
example, if one knows that the observables can only occupy a bounded region 
in the measurement space (a realistic assumption in the case of color features), 
one may allocate a uniform outlier distribution over such a region. 

In our work, we defined an outlier surface class with an associated uniform 
conditional likelihood over the color cube [0 : 255]^. Note that the outlier class 
contributes to the parameter estimation only indirectly, in the sense that equa- 
tion (2) does not apply to it. In other words, color transformations do not change 
the conditional likelihood given the outlier class. 
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2.3 Imposing Illuminant Priors 

A problem with the approach detailed above is that the algorithm, when pre- 
sented with scenes that contain surfaces with very different colors from the ones 
used for training, may arbitrarily “invent” unusual illuminant in order to maxi- 
mize the scene likelihood. Illuminant spectral distributions in outdoor scene are 
rather constrained [4,1]; this observation should be exploited to reduce the risk 
of such catastrophic situations. Prior distributions on the parameters can indeed 
be plugged into the same EM machinery used for Maximum Likelihood by suit- 
ably modifying the function to be maximized at each iteration. More precisely, 
as discussed in [13], imposing a prior distribution on the parameter d translated 
into adding the term logpodDi}) to the function Q {{Di}, {D^}) in Appendix 
A before the maximization step. Assuming that the different illuminants in the 
scene are statistically independent, we may write 

L 

logpD{{Di}) = '^logpdidi) (7) 

i 



where di = (d/,i, We will assume that all L illuminant types have the 

same prior probability. 

One way to represent the illuminant priors could be to start from the CIE 
parametric curve, thus deriving a statistical model for the matrices Di. Another 
approach, which we used in this work, is to take a number of pictures of the 
same target under a large number of illumination conditions, and analyze the 
variability of the color transformation matrices. For example, in our experiments 
we took 39 pictures of the Macbeth color chart at random times during daylight 
over the course of one week. For each corresponding color square in each pair of 
images in our set, we computed the ratios of the r, g and b color components in 
the two images. The hope is that the ensemble of all such triplets can adequately 
model the distribution of diagonal color transformations. We built a Gaussian 
model for the prior distribution of d by computing the mean pd and covariance 
Ed of the collected color ratio triplets. These Gaussian priors can be injected 
into our algorithm by modifying (5) into 



{di,m} = arg max 



3 

ui ^ log \di^fh\ - 0.5 d'i(Gi + E~^)di + {H[ + p'^E~^)di 

ffl—1 



( 8 ) 

Fortunately, even with this new formulation of the optimization criterion, we 
can still use the Direction Set algorithm for maximization, as in Section 2.1. 



3 Experiments 

3.1 Macbeth Color Chart Experiments 

In order to provide a quantitative evaluation of our algorithm’s performance, 
we first experimented using mosaics formed by color squares extracted from 
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Fig. 1. (a) The color squares used for training (center row) and for testing(top and 
bottom row) in the Macbeth color chart experiment, (b) The probability of correct 
match P{C) as a function of the total number of distractors for the Macbeth color 
chart experiment without illuminant compensation (solid line), with ML illuminant 
compensation (dotted line), with MAP illuminant compensation without the outlier 
class (dashed line), and with MAP illuminant compensation using the outlier class 
(dashed-dotted line). 



pictures of the GretagMacbeth^*^ chart under different illuminations, taken by a 
Sony DSC-S75 camera^. We picked 5 colors from the chart (No. 2, 14, 3, 22,11) 
as representative of 5 classes of interest. In Figure 1(a) we show the five colors 
as seen under evening light (after sunset, top row), direct afternoon sunlight 
(center row), and diffuse (shade) afternoon light (bottom row). We ran a number 
of experiment by training the system over the color squares in the middle row 
of Figure 1(a), and testing on a mosaic composed by color squares in the the 
top and bottom row, as well as by other colors in the chart (“distractors”). 
More precisely, for each test we formed two samples, one from the top row and 
one from the bottom row of Figure 1(a). Each sample had a number (randomly 
chosen between 0 and 5) of color squares, randomly chosen from those in the 
corresponding row, making sure that at least one of such two samples was not 
empty. Then, we augmented each sample with a number of randomly selected 
distractors, that were not present in the training set. The test image is the union 
of the two samples. We ran 100 tests for each choice of the number of distractors 
per sample (which varied from 0 to 3). At each test, we first tried to assign each 
non-distractor color in the test image to one of the colors in the training set using 
Euclidean distance in color space. The ratio between the cumulative number of 
correct matches and the cumulative number of non-distractors squares in the 
test images over all 100 tests provides an indication of the probability of correct 
match without illuminant compensation. Such value is shown in Figure 1(b) by 
solid line. Obviously, these results do not depend on the number of distractors. In 

^ In order to roughly compensate for the camera’s gamma correction, we squared each 
color component before processing. 
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Fig. 2. Color classification experiments: (a): training image; (b) test image; (c) 
classification with illuminant compensation (blue indicates outliers); (d) illuminant- 
compensated version of the test image; (e) classification without illuminant compensa- 
tion; (f) estimated illuminant distribution (black indicates outliers). 



order to test our illumination compensation algorithm, we added some “virtual 
noise” to the colors within each square, by imposing a diagonal covariance matrix 
with marginal variances equal to 10® (the reader is reminded that the color 
values were squared to reduce the effect of camera gamma correction). This 
artifice is necessary in this case because the color distribution in the squares 
had extremely small variance, which would create numerical problems in the 
implementation of the EM iterations. We didn’t need to add noise in the real- 
world tests of Section 3.2. The number L of illuminants in the algorithm was set 
to 2. The probability of correct match after illuminant compensation using the 
ML algorithm of Section 2.1 (without exploiting illuminant priors and without 
the outlier class), of the MAP algorithm of Section 2.3 (without the outlier class), 
and of the MAP algorithm using the outlier class, are shown in Figure 1(b) 
by dotted line, dashed line, and dashed-dotted line respectively. Note that the 
distractors contribute to the determination of the illuminant parameters, and 
therefore affect the performance of the illuminant compensation system, as seen 
in Figure 1(b). Distractors have a dramatic negative effect if the illuminant 
priors are not taken into consideration, since the system will “invent” unlikely 
illumination parameters. However, the MAP algorithm is much less sensitive to 
distractors; if, in addition, the outlier class is used in the optimization, we see 
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(e) (f) (g) 



Fig. 3. Color classification experiments: (a): training image; (b) test image; (c) classifi- 
cation with illuminant compensation (blue indicates outliers); (d) estimated illuminant 
distribution (black indicates outliers); (e) illuminant-compensated version of the test 
image; (f) classification without illuminant compensation; (g) classification without 
illuminant compensation (without outlier class). 



that illuminant compensation allows one to increase the correct match rate by 
5-13%. 



3.2 Experiments with Real— World Scenes 

We tested our algorithm on a number of outdoor scenes, with consistently good 
results. We present two experiments of illuminant-compensated color classifi- 
cation in Figure 2 and 3. Figure 2 (a) and (b) shows two images of the same 
scene under very different illuminants. Three color classes were trained over the 
rectangular areas shown in the image of Figure 2 (a), with one Gaussian mode 
per class. The result of classification after illuminant compensation are shown 
in Figure 2 (c). Pixels colored in blue are considered outliers. Note that for this 
set of images, we extended our definition of outlier to overexposed pixels (i.e.. 











Learning Outdoor Color Classification from Just One Training Image 411 



pixels that have one or more color components equal to 255), which typically 
correspond to the visible portion of the sky. Figure 2 (d) shows the illuminant- 
compensated image. It is seen that normalization yields colors that are very 
similar to those in the training image. The assignment of illuminant types is 
shown in Figure 2 (f). Figure 2 (e) shows the result of classification without illu- 
minant compensation. In this case, large image areas have been assigned to the 
outlier class, while other areas have been misclassified. Comparing these results 
with those of Figure 2 (c) shows the performance improvement enabled by our 
illuminant compensation algorithm. 

Our second experiment is described in Figure 3. Figure 3 (a) shows the train- 
ing image (three classes were trained using the pixels within the marked rect- 
angles), while Figure 3 (b) shows the test image. The results of classification 
after illuminant estimation are shown in Figure 3 (c). Pixels colored in blue are 
considered outliers. The assignment of illuminant types is shown in Figure 3 (d), 
while Figure 3 (e) shows the illuminant-compensated image; note how illumi- 
nant compensation “casts light” over the shaded areas. The classifier without 
illuminant compensation (Figure 3 (f)) finds several outlier pixels in the shadow 
area. If forced to make a choice without the outlier option (Figure 3 (g)), it 
misclassifies the pixels corresponding to stones in the pathway. 
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Appendix A 

In this Appendix, we show how the total likelihood Lp(C) can be maximized 
over the parameters using Expectation Maximization (EM). Using the 

diagonal illumination model, we can rewrite Lp(C) as 

L K 

Ld{C) = y^iog EE PK{k)PL{l)p{Dic{x)\k) det \Di\ (9) 

X 1—1 k—1 

As customary with the EM procedure, we first introduce the “hidden” variables 
zi,k{x) which represent the (unknown) label assignments: Zk,i{x)=l if the pixel 
X is assigned to the illuminant type I and surface class k; Zk,i{x)=0 otherwise. 
We will denote the set of Zk,i{x) over the image by Z. 

The EM algorithm starts from arbitrary values for the diagonal matrices 
{Df}, and iterates over the following two steps: 

— Compute 

g({A},{A°}) = e^do} [\ogp{D,}{c,z)\c] (10) 

= [logP{D,}(C'|Z)|C] +£1 {dO} [logP{D,}(Z)|C'] 

where E^p) 0 ^[-\C] represents expectation over p^p) 0 ^{Z\C). 

— Replace {D^} with arg max Q (^{Di},{Df}) . 

Given the assumed independence of label assignments, and the particular 
form chosen for variables Zky{x), we can write 

K L 

logp{B,}(C|Z) = EEE^fc. ilogp{Dic{x)\k) 

X k^l 1^1 



( 11 ) 
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where, given that the conditional likelihood are assumed to be Gaussian: 



logp{Dic{x)\k) = -1.5 log(27r) -log|det27fc| 
-0.5 (Dic{x) - iik)Sk^{Dic{x) - fXk) 

K L 

logP{A}(^) = EEE Zk,l (log P/C (fc) + log Pl(0) 



It is easy to see that 



( 12 ) 



(13) 



K L 

E{d°} [logP{/3,}(C'|Z)|C] ='^'^'^logp{Dic{x)\k)P[DOy{k,l\c{x)) (14) 

X k—1 1—1 

and 

K L 

E{d°} [logP{D,}(^)|C'] = ^^^(logP/c(fc) + logPL(Z))P{/ 50 }(fc,/|c(a:)) 

X k—1 1—1 

(15) 

where P^i^o^{k,l\c{x)) = E^DO-^[zk,i{x)] is the posterior probability of surface 
class k and illumination type I given the observation c{x) and under color trans- 
formation matrices {Df}. Using Bayes’ rule, we can compute P^iyoy{k,l\c{x)) 
as 

P{D°}ik,l\c{x)) = (16) 

PK{k)PL{l)p{Dfc{x)\k)det\D'^\ 

Eti Eti PKCk)PLmDfc(x)lk) det |Pp| 

Remembering that \Di\ = \di^\di^ 2 di, 3 \ one sees that 

rnax Q({PJ,{P;°}) (17) 

\^l,m j 



where di 



L 

= max 7 



3 

Ui ^ log \di^rn\ - 0.5 d'lGidi + H[di 

m—1 



(di.i,di,2,di.3)', and 



K 

ui = EE P{D°}{k, l\c{x)) 

X k—1 



(18) 



K 

Gi 

,m,s — EE '5^fc,m,sCm(^)Cs(x)-P| (/c, /|c(x)) (^9) 

X k—1 
K 3 

— EEE Sk ,m,sCs {x)pk,sP{D°}{k,l\c{x)) (20) 

X k—1 s—1 

where . Note that the terms in the summation over I in (17) can be 

maximized independently, meaning that the L sets of diagonal transformations 
can be computed independently at each iterations. 
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Abstract. We address the problem of comparing attributed trees and 
propose a novel distance measure centered around the notion of a max- 
imal similarity common subtree. The proposed measure is general and 
defined on trees endowed with either symbolic or continuous-valued at- 
tributes, and can be equally applied to ordered and unordered, rooted 
and unrooted trees. We prove that our measure satisfies the metric con- 
straints and provide a polynomial-time algorithm to compute it. This 
is a remarkable and attractive property since the computation of tra- 
ditional edit-distance-based metrics is NP-complete, except for ordered 
structures. We experimentally validate the usefulness of our metric on 
shape matching tasks, and compare it with edit-distance measures. 



1 Introduction 

Graph-based representations have long been used with considerable success in 
computer vision and pattern recognition in the abstraction and recognition of ob- 
jects and scene structure. Concrete examples include the use of shock graphs to 
represent shape-skeletons [11,15], the use of trees to represent articulated objects 
[7] and the use of aspect graphs for 3D object representation [3]. The attractive 
feature of structural representations is that they concisely capture the relational 
arrangement of object primitives, in a manner which can be invariant to changes 
in object viewpoint. Using this framework we can transform a recognition prob- 
lem into a relational matching problem. The problem of how to measure the 
similarity or distance of pictorial information using graph abstractions has been 
a widely researched topic of over twenty years. 

The classic metric approach to graph comparison is edit-distance [4]. The 
idea behind this approach is that it is possible to identify a set of basic edit 
operations on nodes and edges of a structure, and to associate with these op- 
erations a cost. The edit-distance is found by searching for sequences of edit 
operations that will make the two graphs isomorphic with one-another, and the 
distance between the two graphs is then defined to be the minimum over all the 
costs of these sequences. By making the evaluation of structural modification 
explicit, edit-distance provides a very effective way of measuring the similarity 
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of relational structures. Moreover, the method has considerable potential for er- 
ror tolerant object recognition and indexing problems. Unfortunately, the task 
of calculating edit-distance is an NP-hard problem [24], hence, goal-directed ap- 
proximations are necessary to calculate it. The result is that the approximation 
almost invariably breaks the theoretical metric properties of the measure. 

Recently, a new and more principled approach to the definition of distance 
measure has emerged. In [2], Bunke and Shearer introduce a distance measure 
on unattributed graphs based on the maximum common subgraph and prove 
that it is a metric. Wallis et al. [20] introduce a variant of this distance based on 
the size of the minimum common supergraph. Finally, Fernandez and Valiente 
[5] define a metric based on the difference in size between maximum common 
subgraph and minimum common super graph. More recently, in [6] Hidovic and 
Pelillo extend these metrics to the case of attributed graphs. Unfortunately all 
these metrics require the calculation of the maximum common subgraph, which 
is computationally equivalent to the calculation of edit-distance. 

In many computer vision and pattern recognition applications, such as shape 
recognition [13,15,17], the graphs at hand have a peculiar structure: they are 
connected and acyclic, i.e., they are trees, either rooted or unrooted, ordered or 
unordered, and frequently they are endowed with symbolic and/or continuous- 
valued attributes. Most metrics on trees found in the literature are defined in 
terms of edit-distance [18,21]. Zhang and Shasha [23] have investigated a special 
case of edit-distance which involves trees with an order relation among sibling 
nodes in a rooted tree. This special case constrains the solution to maintain the 
order of the children of a node. They showed that this constrained tree-matching 
problem is solvable in polynomial time and gave an algorithm to solve it. Re- 
cently, Sebastian, Klein and Kimia [13] use a similar algorithm to compare shock 
trees. Unfortunately, in the general case the problem has been proven to be NP- 
complete both for rooted [24] and unrooted trees [25]. Recently, Valiente [19] 
introduced a bottom-up distance measure between trees that is an extension to 
trees of the graph metric introduced by Bunke and Shearer [2] , proving that the 
measure can be calculated in polynomial time on trees, but falling short of prov- 
ing that the measure is a metric. While this measure can be calculated efficiently 
both on ordered and unordered trees, it is limited to rooted and unattributed 
trees. 

Motivated by the work described in [6] , in this paper we propose a normalized 
distance measure for trees equipped with either symbolic or continuous-valued 
attributes. We prove that the proposed measure fulfills the properties of a met- 
ric, and provide a polynomial-time algorithm to compute it. At an abstract level, 
our approach involves the computation of a maximum similarity common sub- 
tree. This allows us to define equivalent variations of the metric on ordered and 
unordered, rooted and unrooted, and attributed and unattributed trees. Since 
edit-distance on ordered trees can be computed in polynomial time, in the paper 
we focus on the unordered case where our approach provides a clear compu- 
tational advantage. To show the validity of the proposed measures, we present 
experiments on various shape matching tasks and compare our results with those 
obtained using edit-distance metrics. 
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2 Preliminaries 

Let G = (V,E) be a graph, where V is the set of nodes (or vertices) and E is 
the set of undirected edges. Two nodes u,v G V are said to be adjacent (denoted 
t 6 ~ w) if they are connected by an edge. A path is any sequence of distinct nodes 
uqUi ■ ■ .Un such that for alH = 1 . . . n, Ui-i ~ up, in this case, the length of the 
path is n. If ~ uq the path is called a cycle. A graph is said to be connected if 
any two nodes are joined by a path. Given a subset of nodes C QV , the induced 
subgraph G[C] is the graph having C as its node set, and two nodes are adjacent 
in G[G] if and only if they are adjacent in G. With the notation |G| we shall 
refer to the cardinality of the node-set of graph G. 

A connected graph with no cycles is called an unrooted tree. A rooted (or 
hierarchical) tree is a tree with a special node that can be identified as the root. 
In what follows, when using the word “tree” without qualification, we shall refer 
to both the rooted and unrooted cases. Given two nodes u,v G V in a rooted 
tree, u is said to be an ancestor of v (and similarly v is said to be a descendent 
of u ) if the path from the root node to u is a subpath of the path from the root 
to V. Furthermore, if m ~ w, u is said to be the parent of v and v is said to be a 
child of u. Both ancestor and descendent relations are order relations in V. 

Let Ti = {Vi,Ei) and T2 = (V2, E2) be two trees. Any bijection (j)-. Eli ^ H2, 
with Hi C Vi and H2 C V2, is called a subtree isomorphism if it preserves both 
the adjacency relationships between the nodes and the connectedness of the 
matched subgraphs. Formally, this means that, given u,v G Hi, we have u ^ v 
if and only if (j>(u) ~ (p{v) and, in addition, the induced subgraphs Ti[Hi] and 
T2[H2] are connected. Two trees or rooted trees Ti and T2 are isomorphie, and 
we write Ti = T2, if there exists an isomorphism between them that maps every 
node in Ti to every node in T2. It is easy to verify that isomorphism is an 
equivalence relation. We shall use the notations Dom((()) and Im(())) to denote 
the domain and the image of </>, respectively. 

Formally, an attributed tree is a triple T = (V,E,a), where (V,E) is the 
“underlying” tree and a is a function which assigns an attribute vector a(u) 
to each node m G G. It is clear that in matching two attributed trees, our 
objective is to find an isomorphism which pairs nodes having “similar” attributes. 
To this end, let a be any similarity measure on the attribute space, i.e., any 
(symmetric) function which assigns a positive number to any pair of attribute 
vectors. If (j) : Hi -G H2 is a subgraph isomorphism between two attributed 
trees Ti = {Vi,Ei,ai) and T2 = {V2, E2,a2), the overall similarity between the 
induced subtrees Ti[Hi] and T2[H2] can be defined as follows: 

Wcr{(l))= ^ a{u,(j){u)) . ( 1 ) 

u^Hi 

where, for simplicity, we define a{u, 4 >{u)) = cr(ai(M), a2(<('('w)))- The isomor- 
phism (j) is called a maximum similarity subtree isomorphism if Wa-{ 4 >) is largest 
among all subtree isomorphisms between Ti and T2- For the rest of the paper we 
will omit the subscript a when the node-similarity used is clear from the con- 
text. Two isomorphic attributed trees Ti = {Vi, Ei,ai) and T2 = (V2, FI2, 0:2), 
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with isomorphism (p, are said to be attrihute-isomorphic if for all u G Vi we have 
ai{u) = a2{4>{u)). In this case we shall write Ti =„ T2. Attribute-isomorphism 
is clearly an equivalence relation. 

Note that the problem of determining a maximum similarity subtree iso- 
morphism is a direct extension of the standard problem of finding a maximum 
(cardinality) common subtree, in fact the two problems are equivalent when the 
similarity cr is degenerate, i.e., a{u,v) = 1. 

Now, given a set S, a function d:S'x 5 '— ^-Kisa metric on S if the following 
properties hold for any x,y,z € S. 

1. d{x,x) > 0 (non-negativity) 

2 . d{x, y) = 0 X = y (identity and uniqueness) 

3 . d{x,y) = d{y,x) (symmetry) 

4 . d{x, y) + d{y, z) > d(x, z) (triangular inequality). 

Furthermore, if the function satisfies d(x, y) < 1 it is said to be a normalized 
metric. 

li d : S X S ^ R+ is a normalized metric, then the similarity function 
derived from 6, defined as a{x, y) = 1 — d{x, y) fulfills the identity, uniqueness 
and similarity properties. Furthermore, it fulfills the following variant of the 
triangular inequality: a{x,y) + a{y,z) — a{x,z) < 1 . In the rest of the paper, 
we shall assume that all similarity functions are indeed derived from normalized 
metrics. 

It is straightforward to show that, with this assumption, we have 

Ti^aT2^\Ti\ = \T2\ = W{P) (2) 

where (f> is a, maximum similarity isomorphism between T\ and T2. 

3 Distance Metric 

In this section, we define our measure for comparing attributed trees and prove 
that it fulfills the metric properties. First, we prove a lemma that turns out to 
be instrumental to prove our results, then, we introduce our measure and prove 
the metric properties. 

Lemma 1 . Let Ti, T2 and T3 be three trees, and (j)i2 4>23, be maximum 

similarity subtrees isomorphisms between T\ and T2, T2 and T3, and T\ and T3, 
respectively. Then, we have: IT2I > W{p\2) + W{p23) — W{piz). 

Proof. Let 1^2^ = Im(</)i2) C V2, V2 = Dom((()23) C V2 be the sets of nodes in V2 
mapped by the isomorphisms pi2 and p23, respectively. Furthermore, let 1^2 = 
V"2^ n Vf, be the set of vertices in V2 that are mapped by both isomorphisms. It 
is clear that the subtrees Ti = Ti[pf2{^2)] and T3 = T^[p23{y2)] are isomorphic 
to each-other, with isomorphism (^13 = pi2 o p23, where o denotes the standard 
function composition operator, restricted to the nodes of Ti. The similarity of 
this isomorphism is 

w{^13) = ^ (y{PfKv),p23{v)) . 

V£V2 
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Since </>i 3 is a maximum similarity subtree isomorphism between T\ and T^, we 
have W{(j>iz) > Hence 



W{(t)i2) + W{(t>23) - W{<Pi3) < 1 ^( 012 ) + W{<I>23) ~ = 

a{(j}^2iv),v)+ Y <^{v,(l)23iv)) - Y ^(<l^l2iv),4>23iv)) = 

vev^ veV2 

Y ^(<^12Hv),v)+ Y <^(^’^23(v)) + 

Y [^(<^12^^)^^) + <^(v^<^23(v)) - 0-(<f>12\v),<p23(v))] < 

veV2 

\vi \ vi\ + \vi \ ^2^1 + \vi n v^\ = \vi u vi\ < IT2I , 



where the inequality follows from the triangular inequality for metric-derived 
similarities. □ 

Let T be the quotient set of trees modulo attribute-isomorphism, that is the 
set of trees on which two trees are considered the same if they are attribute- 
isomorphic.^ For any T\,T 2 & "T we define the following distance function 



d(Ti,T2) = l- 



W{(j)^2) 

max(|Ti|, IT 2 I) ■ 



(3) 



Theorem 1. d is a normalized metric in T- 
Proof. 

1. d(Ti,T2) >0 

We have 0 < W{<j)i 2 ) < max(|Ti|, |T 2 |). Hence, 0 < d(Ti,T 2 ) = 1 - 
< 1 . 

2. d(Ti,T2) = 0 ^ Ti ^aT2 

Let us consider the direction of implication 4= (identity). From (2), we have 
Ti =a T 2 ^ |Ti| = |T 2 | = W(012). Hence diT^T^) = = 

0 

For the reverse implication (uniqueness), we have d{Ti,T 2 ) = 0 W{(j)\ 2 ) = 

max(|Ti|, IT 2 I). Since W{(j)i 2 ) < min(|Ti |, |T 2 |) < max(|Ti|, IT 2 I), we have 
W{(t>i 2 ) = min(|Ti|, IT 2 I) = max(|Ti|, IT 2 I). Hence, (2) yields Ti T 2 . 

3. d(Ti,T2) = di(T2,Ti) 

This follows directly from the symmetry of the maximum similarity graph 
and of the function max. 

4. d(Ti,T2) +d(T2,r3) > d(Ti,T3) 

^ The quotient set formalizes the intuitive idea that two attributed trees are indistin- 
guishable when they are attribute-isomorphic. Furthermore, it is needed in order to 
fulfill the uniqueness property of a metric. 
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The triangular inequality can be simplified to the inequality 

max(|Ti|, IT2I) max(|T2|, \n\) max(|Ti|, IT3I) > 
fP(,^i2) max(lT2|, IT3I) max(|Ti|, IT3I) + 1 T( 023 ) max(|Ti|, IT2I) max(|Ti|, IT3I)- 

W{<l>i3) max(|Ti|, IT2I) max(|T2|, IT3I) ( 4 ) 

To prove this we need to separately analyze each of the six possible cases 
1 . |Ti| > |T2| > IT3I 2 . |Ti| > IT3I > IT2I 3 . IT2I > |Ti| > IT3I 
4 . IT2I > IT3I > |Ti| 5 . IT3I > |Ti| > IT2I 6. |T 3 | > IT2I > |Ti|. 

However, the roles of T\ and T3 in our proofs are symmetric, hence we can use 
this symmetry to reduce the analysis to three cases: (a) IT2I > |Ti| > IT3I, 
(b) |Ti| > |T2| > IT3I, and (c) |Ti| > IT3I > IT2I. 

a) IT2I > |Ti| > \Ts\ 

The triangular inequality reduces to |T'i||T2| > lT((}()i2)|ri|-|-lT((/)23)|T'i| — 
m</- 13 )|T 2 |. 

\Ti\\T2\ > |ri|(fT(012) + lT(<^23)-W^(<^13)) > 

W{(t>12)\Tl\ + W{<P23 )\Ti\ - VP((/>13)|T2| 

b) |Ti| > IT2I > IT3I 

Equation ( 4 ) reduces to |Ti | |T2 1 > lE((/)i2) IT2 |-blE(<^23) l^i | -fE(</>i3) 1^2 1 . 
|Ti||T2| = |T2|(|Ti| - IT 2 I) + \T2f > VE(023)(|Ti| - IT 2 I) + IT 2 P > 

VE(023)(|Ti| - |T2|) + |T2| (1T(<^12) + 1E(<^23) - W{(j)i3)) = 

W{(t>12)\T2\ + W{<t>23)\Tl\ - W{(t>l3)\T2\ 

C) |Ti| > IT3I > IT2I 

We need to prove IT1HT3I > W(</)i2)|T3| + W{(j>23)\T^\ - . 

IT1IIT3I > |ri||T2|-|T2||T3| + |T2||T3| > W{(f>23)m\-\T3\) + \T3\\T2\ > 
4 E( 023 )(|Ti| - |T 3 |) + \n\ (W( 012 ) + W( 023 ) - W^(</> 13 )) = 

fE(.^12)|T3| + W{<P23)\Ti\ - |T3|W((/.13). □ 

4 Extracting the Maximum Similarity Common Subtree 

In this section we give a polynomial-time algorithm for finding a maximum 
similarity subtree. The algorithm is based on the subtree identification algorithm 
presented by Matula [ 9 ], extending it in two ways. First, it generalizes it to deal 
with attributed trees and, second, it extends it to solve the more general problem 
of extracting the maximum (similarity) subtree and not merely to verify whether 
one tree is a subtree of the other. We give an algorithm to find the maximum 
similarity common subtree problem for rooted trees, and then we show how the 
same algorithm can be used for the unrooted tree case. 
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Let Ti = {Vi,Ex) and = (V 2 ,E 2 ) be two rooted trees, and let u € Vi and 
w € V 2 - We say that a subtree isomorphism between Ti and T 2 is anchored at 
nodes u and w, if the subtrees of Ti and T 2 induced by the isomorphism are 
rooted at u and w, respectively. In this case, we shall write to refer to any 

isomorphism anchored at u and w. Clearly, if (/) is a maximum similarity subtree 
isomorphism, we have 

W{6) = max max . 

(u^w)^Vi X V2 

To determine the maximum similarity subtree isomorphism anchored at nodes 
u and w we adopt a divide-and-conquer approach. Let rti, • • • , be the children 
of node m in Ti , and , • • • , Wm the children of node w in T 2 . Without loss of 
generality, we can assume n < m. Moreover, let us assume that we know, for 
each i = 1, - ■ ■ ,n and j = 1, • • • , m, a maximum similarity subtree isomorphism 
]p(ui,wj) anchored at Ui and Wj. Let Wij be the similarity of then the 

computation of a maximum similarity subtree isomorphism anchored at u and 
w can be reduced to an assignment problem on the children of u and w, i.e., 

n 

W = a(u, w) + m^ ^ , (5) 

where 27™ is the space of all possible assignments between a set of cardinality n 
and one of cardinality m. As a consequence, if tt is the optimal assignment, the 
function defined as: 



“)(x) = 



if a; = u 

if a; G Dom((/)(“‘’™’^(*))) 



( 6 ) 



turns out to be a maximum similarity subtree isomorphism anchored at u and w. 

Figure 1 shows the resulting algorithm for determining a maximum similarity 
subtree isomorphism of two rooted attributed trees. Since in the rest of the paper 
we only need the maximum similarity induced by an isomorphism, and not the 
isomorphism itself, for simplicity the main procedure Similarity accepts as 
input a pair of attributed rooted trees and returns only the similarity value. It 
makes use of a recursive procedure AnchoredSimilarity that accepts as input 
two vertices, one from Ti and the other from T 2 and returns the similarity of the 
maximum isomorphism anchored at the input vertices, according to (5). To this 
end, it needs a procedure for solving an assignment (or, equivalently, a bipartite 
matching) problem, of which the algorithms literature abound (see., e.g., [1]). 
The calculation of the maximum similarity common subtree of two trees with 
N and M nodes respectively, is reduced to at most NM weighted assignments 
problems of dimension at most b, where b is the maximum branching factor of the 
two trees. The computational complexity of our algorithm heavily depends on the 
actual implementation of the assignment procedure. A popular way of solving it, 
and the one we actually employed, is the so-called Hungarian algorithm, which 
has complexity O(n^m), n and m being the number of children of u and v as used 
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Similarity (Ti ,T2) 
maxsim=0 

for each node u in Ti 

sim=AnchoredSimilarity(rt,root {.T2') ) 
if sim > maxsim 
maxsim=sim 

for each node w in T2 

sim=AnchoredSimilar ity (root (Ti ) , w) 
if sim > maxsim 
maxsim=sim 

return maxsim 



AnchoredSimilarity(ti, w) 

Cu=children(u) 

Cu,=children(ui) 
for each m in Cu 

for each Wj in Cu, 

Wij=AnchoredSimilarity (tii, Wj) 
return <j{u,w) + Assign({uiij}) 



Fig. 1. A polynomial-time algorithm for computing the similarity between two trees. 



in ( 5 ), with n < m. It is simple to show that, using the Hungarian algorithm, 
our algorithm has overall complexity of 0 {bNM). Of course, the algorithm can 
be sped up by using more sophisticated assignment procedures [ 1 ]. 

Finally, if we have two unrooted trees Ti = (Vi,ifi) and T2 = (V2,E2), 
we can still pick two nodes ri € V2 and r2 G V2, and consider the trees 
T[^ = (Vi,Ei) and Tp = (V2,E2) rooted at ri and T2, respectively. Note 
that if (j) is an isomorphism between and with similarity W , then it 
is an isomorphism between T\ and T2 with the same similarity. This yields a 
straightforward 0 {bN^M) algorithm for unrooted trees, which consists of iter- 
atively calling Similarity(T“, T^) for all u G Vi and w G V2, and taking the 
maximum. However, we do not actually need to try all possible pairs of roots 
since by simply fixing the root in one tree and let the other vary among all 
possible vertices in the other tree, the algorithm is still guaranteed to achieve 
the maximum similarity. This yields an 0 {bN‘^M) algorithm for unrooted trees. 

5 Experimental Results 

We evaluated the new metric on three different tree-based shape representations. 
The first is the shock tree representation used by Pelillo, Siddiqi and Zucker in 
[ 11 ], which is based on the differential structure of the boundary of a 2 D shape. 
It is obtained by extracting the skeleton of the shape, determined as the set of 
singularities (shocks) arising from the inward evolution of the shape boundary, 
and then examining the differential behavior of the radius of the bitangent circle 
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Fig. 2. Distance matrices from the first experiment. Left: Our metric. Right: Edit- 
distance. 



to the object boundary, as the skeleton is traversed. This yields a classifica- 
tion of local differential structure into four different classes [15]. The so-called 
shock-classes, distinguish between the cases where the local bitangent circle has 
maximum, minimum, constant, or monotonic radius. The labeled shock-groups 
are then abstracted using a rooted tree where two vertices are adjacent if the 
corresponding shock-groups are adjacent in the skeleton, and the distance from 
the root is related to the distance from the shape barycenter. Here, we used the 
same attributes and node-distances employed in [11]. Each shock was attributed 
with its coordinates, distance from the border, and propagation velocity and 
direction. The distance between two nodes, was defined as a convex combina- 
tion of the (normalized) Euclidean distances of length, distance to the border, 
propagation speed, and curvature. 

We compared our distance metric with edit-distance. To approximate the 
edit-distance we used the relaxation labeling algorithm presented in [17] with 
the following costs: we defined the cost of matching node u to node w to be 
equal to the distance between their attributes, while the cost of removing any 
node to be equal to 1. Note that, with these costs, edit-distance is not normalized. 

Our shape database contained 29 shapes from 8 different classes. Figure 2 
shows the distance matrices obtained using our metric and edit-distance. Here, 
lighter colors represent lower distances while darker colors represent higher dis- 
tances. As can be seen, the same block structure emerges in both matrices. 
Essentially, the most significant difference among the two metrics is the dark 
bands clearly visible in the edit-distance matrix. 

In order to assess the ability of the distances to preserve class structure, 
we performed pairwise clustering. In particular, we used two pairwise cluster- 
ing algorithms: Shi and Malik’s Normalized Cut [14], and Pavan and Pelillo’s 
Dominant Sets [10]. Figure 3 shows the clusters obtained with both algorithms, 
displayed in order of extraction. While the performance of the clustering algo- 
rithms, on this shape recognition task, varied significantly, the dependency on 
the choice of the distance measure was less pronounced. Nonetheless, some dif- 
ferences can be observed. In particular, we notice how Normalized Cut exhibits 
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Fig. 3. Clusters obtained with Normalized Cut and Dominat Sets in the first experi- 
ment. 



a well-known tendency to over-segment the data. The clusters obtained with 
the Dominant Sets approach are much better, with our metric providing results 
almost identical to edit-distance. 

As for the running times, on a Pentium 4 2.5GHz PC, the maximum simi- 
larity algorithm presented in Section 4, took around 8 seconds to compute our 
metric, while the relaxation labeling algorithm computed edit-distance in over 
30 minutes. 

Our second set of experiments used a larger database of shapes abstracted 
again in terms of shock-trees. Here, however, we used a different set of attributes 
recently analyzed in [16], i.e., the proportion of the shape boundary generating 
the corresponding shock-group. The database consisted of 150 shapes divided 
into 10 classes of 15 shapes each, and presented a higher structural noise than the 
previous one. Here the node distance and node-matching cost for edit-distance 
was defined as the absolute difference between the attributes, while the node 
removal cost was the value of the attribute itself. With this edit costs edit- 
distance is a normalized metric. 

Figure 4 shows the distance matrices obtained using our metric and edit- 
distance. Note that, as before, both matrices exhibit the same block structure. 
We applied the same clustering algorithms used in the previous series of exper- 
iments. In order to assess the quality of the groupings, we used two well-known 
cluster- validation measures [8]. The first is the standard misclassification rate. 
We assigned to each cluster the class that has most members in the cluster. The 
members of the cluster that belong to a different class are considered misclassi- 
fied. The misclassification rate is the percentage of misclassified shapes over the 
total number of shapes. To avoid the bias towards higher segmentation that this 
measure exhibits, we also used a second validation measure, i.e., the Rand index. 
We count the number of pairs of shapes that belong to the same class and that 
are clustered together and the number of pairs of shapes belonging to different 
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Fig. 4. Distance matrices from the second experiment. Left: Our metric. Right: Edit- 
distance 



classes that are in different clusters. The sum of these two figures divided by the 
total number of pairs gives us the Rand index. Here, the higher the value, the 
better the classification. 

Table 1 summarizes the results obtained using Normalized Cut and Dominant 
Sets. Here the two metrics generate clusters with comparable validation measures 
regardless of the clustering algorithm used. 



Table 1. Validation measures of clusters obtained in the second experiment. 





Misclassification rate 


Rand index j 




Normalized Cut 


Dominant Sets 


Normalized Cut 


Dominant Sets 


Our metric 


23.3% 


21.3% 


90.3% 


90.8% 


Edit-distance 


22.7% 


24.0% 


90.4% 


90.8% 



The last set of experiments was performed on a tree representation of North- 
ern Lights [12]. As in the previous experiments, the representation used is derived 
from the morphological skeleton, but the choice of structural representation was 
different from the one adopted for shock-graphs, and the extracted trees tend to 
be larger. The database consisted of 1440 shapes. Using our metric we were able 
to extract the full distance matrix within a few hours, but it was unfeasible to 
compute edit-distance on the entire database. For this reason, in order to be able 
to compare the results with edit-distance, we also performed experiments using 
a smaller database consisting of 50 shapes. The calculation of edit-distance, even 
on this reduced database, took a full weekend. 

In this case, we did not have the ground truth for the class memberships, 
so we needed a different cluster-validation measure. We opted for a standard 
measure that favors compact and well-separated clusters: the Davies-Bouldin 
index [8]. Let be the average distance between elements in class i, and dij 
the average distance between elements in cluster i and elements in cluster j The 
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Fig. 5. Distance matrices from the second experiment. Left: Our metric. Right: Edit- 
distance. 



Davies-Bouldin index is 



DB = 




( 7 ) 



where c is the number of clusters and Rij = is the cluster separation 

measure. Clearly, lower values correspond to better separated and more compact 
clusters. 

Table 2 provides the values of the Davies-Bouldin index on the clusters ex- 
tracted using Normalized Cut and the Dominant Sets algorithm. As was the case 
with the previous experiments, both metrics produced comparable results. 



Table 2. Davies-Bouldin index of clusters obtained in the third experiment. 





Normalized Cut 


Dominant Sets 


Our metric 


0.0486 


0.0723 


Edit-distance 


0.0232 


0.0635 



6 Conclusions 

In this paper we have presented a novel distance measure for attributed trees 
based on the notion of a maximum similarity subtree isomorphism, and provided 
a polynomial-time algorithm to calculate it. We have proven that this measure 
satisfies the metric properties and have experimentally validated its usefulness by 
comparing it with edit-distance on three different shape recognition tasks. Our 
experimental results show that, in terms of quality, the proposed metric compares 
well with edit-distance, its computation being, however, orders of magnitude 
faster. 
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Abstract. We present a probabilistic algorithm for finding correspon- 
dences across multiple images. The algorithm runs in a distributed set- 
ting, where each camera is attached to a separate computing unit, and 
the cameras communicate over a network. No central computer is in- 
volved in the computation. The algorithm runs with low computational 
and communication cost. Our distributed algorithm assumes access to a 
standard pairwise wide-baseline stereo matching algorithm (WBS) and 
our goal is to minimize the number of images transmitted over the net- 
work, as well as the number of times the WBS is computed. We employ 
the theory of random graphs to provide an efficient probabilistic algo- 
rithm that performs WBS on a small number of image pairs, followed by 
a correspondence propagation phase. The heart of the paper is a theoret- 
ical analysis of the number of times WBS must be performed to ensure 
that an overwhelming portion of the correspondence information is ex- 
tracted. The analysis is extended to show how to combat computer and 
communication failures, which are expected to occur in such settings, 
as well as correspondence misses. This analysis yields an efficient dis- 
tributed algorithm, but it can also be used to improve the performance 
of centralized algorithms for correspondence. 



1 Introduction 

Settings with large numbers of cameras are spreading in many applications of 
computer vision, such as surveillance, tracking, smart environments, etc. [11, 
7,16,5] Existing vision applications in a multi-camera setting are based on a 
central computer that gathers the information from all cameras, and performs 
the necessary computations. In some cases, part of the computation is performed 
locally at the cameras’ sites (e.g., feature detection or local tracking), and then 
the overall solution is computed by the central computer. 

Controlling a large application involving many cameras by a central server 
has the advantage that the computation, once performed, is reliable and can 
utilize all of the information in one place. But it has disadvantages that often 
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outweigh the advantages. First, since many vision applications require a signif- 
icant amount of computation, centralized solutions are often not scalable: their 
performance degrades as the number of sites grows. In addition, the server can 
become a communication hot-spot and possible bottleneck. Finally, the central 
server is a single point of failure. If it fails or is unreachable for a while, the 
applications it governs may fail. Moreover, the possibility of temporary failures 
grows when the system is dynamic and, for example, cameras occasionally join 
or leave the system, or move from one place to another. These disadvantages 
of the centralized approach motivate an investigation of techniques for solving 
computer vision applications that are not based on a central server. The pro- 
cessing units at different cameras communicate among themselves and perform 
whatever computations may be needed in the application. The scenario in which 
many of the cameras are attached to reasonably powerful computing devices is 
quite realistic, and supports this approach. 

In this paper we present a distributed approach to computing multi-image 
correspondence in a multi-camera setting. Such correspondence forms the basis 
of many important visual tasks, such as calibration, 3D scene reconstruction, 
and tracking. One way to compute multi-image correspondence is by computing 
correspondence between pairs of images, using a Wide-baseline Stereo (WBS) 
algorithm.^ Computing yVBS for all pairs, which clearly guarantees obtaining 
full correspondence, is costly in terms of both communication and computation. 
Moreover, the computation becomes intractable when a large setting with hun- 
dreds or even thousand of cameras is considered. An alternative is to perform 
yVBS computations on only some of the pairs, and then use the transitivity 
of correspondence to obtain further correspondence information among images 
that were not compared directly. A key aspect of such an algorithm is the choice 
of image pairs to which the yVBS algorithm will be applied. 

Our solution is distributed: every camera is involved in a limited amount 
of communication and performs only a small number of YdBS computations. 
Propagation of the correspondence is performed by local communication between 
cameras. Nevertheless we are guaranteed that, with high probability, the full 
correspondence information is obtained for the vast majority of points at the 
end of the propagation process. Our solution can be tuned so that it will tolerate 
communication failures, processor failures, and failure of the Y^BS computations 
to identify corresponding points in overlapping images. 

A key element in the efficiency of an algorithm such as ours is in the choice of 
which YdBS computations to perform. We employ the theory of random graphs 
in order to obtain a drastic reduction in the number of such computations each 
camera performs. Further reduction is obtained when there is information regard- 
ing cameras that do not view overlapping regions. Finally, we tune our algorithm 
to capture correspondence information for points that are seen by many cam- 
eras. In a multi-image setting, the number of images in which a feature point p 
appears is important. We call this the degree of exposure, or exposure for short. 



^ We use the WBS computation as a black box; a better solution to WBS will improve 
the performance of our scheme. 
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of p. Applications that use correspondence information typically obtain much 
greater benefit from points with high exposure than from ones with low expo- 
sure (e.g., Bundle Adjustment [23]). Accordingly, our algorithm is designed to 
accept an exposure parameter k and will be tuned to find the correspondence 
information of points with degree of exposure that is greater than or equal to k. 
The algorithm has the following features: 

1. Every camera i performs a small number Si of yVBS computations; 

2. cameras exchange correspondence information for at most log 2 k rounds; and 

3. for every feature point p with exposure degree k or more, with probability at 
least 0.99 all cameras that view p obtain the full correspondence information 
regarding p. 

As we show, when there is sufficient information about the relative locations 
of cameras, Si may be c • log 2 k for some constant c. Since WBS computations 
are dominant in this application, the algorithm will then terminate in time that 
is proportional to log 2 k. Our approach is probabilistic, rather than heuristic. 
Moreover, its success is guaranteed with high probability for every given set of 
images (provided that the WBS algorithm is error-free). 

The algorithm is designed in such a way that no single failure can impact 
the quality of the correspondence information obtained in a significant way. 
Moreover, it is robust in the sense that it degrades gracefully as the number 
of failures grows. We extend the algorithm to handle unreliable systems with 
communication failures, processor failures, and failure of the WBS computations 
to identify corresponding points in overlapping images. Roughly speaking, in 
order to overcome a failure rate of / < 1 of the communication channels (resp. a 
portion of / < 1 of the cameras crashes, or a portion of / < 1 of the matches are 
false-negative errors by the WBS), an increase of roughly in the number 
of WBS computations leads to the same performance as in a system with no 
failures. Hence, to overcome a high failure rate of 10%, the cameras need to 
perform only 12% more work! 

While originally motivated by the quest for a distributed solution, our proba- 
bilistic analysis can be applied to reduce the number of WBS computations even 
when correspondence is computed on a single computer. That is, our algorithm 
can be simulated on a centralized computer (replacing the propagation step by a 
simple transitive closure computation) to improve the efficiency of computation 
of existing centralized algorithms. 

The question of how to reduce the number of WBS computations performed 
in the centralized setting has been addressed by Schaffalitzky & Zisserman [21]. 
They suggested a heuristic approach to this problem: first single- view invariants 
are computed and mapped to a large feature vs. views hash table. The hash 
table can then guide the greedy choice of the pairs on which to compute WBS, 
resulting in run-time complexity of 0(n) WBS computations, where n is the 
number of cameras. 

This paper makes two main contributions. One is in providing a reasonably 
efficient solution to the multi-image correspondence problem in a distributed 
system with no central server and no single point of failure. The second is in 
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employing the theory of random graphs in order to reduce the number of yVBS 
computations needed to obtain a useful amount of correspondence information. 



2 Previous Work 

There have recently been a number of significant advances on the subject of 
Wide-haseline Stereo (WBS), in which computations involving a small number 
of images are used to extract correspondence information among the images [24, 
1,17,20,6,15]. 

Schaffalitzky & Zisserman [21] and then Ferrari et. al. [10] suggested methods 
for wide baseline matching among a large set of images (on the order of 10 to 
20 cameras). They both suggested methods for extending the correspondence 
of two (or three) views to n views, while using the larger number of views to 
improve the pairwise correspondence. Their algorithms are designed to run on a 
central computer. Levi & Werman [12] consider the problem of computing the 
fundamental matrices between all pairs of n cameras, based on knowing only the 
fundamental matrices of a subset of pairs of views. Their main contribution is 
an algebraic analysis of the constraints that can be extracted from a partial set 
of fundamental matrices among neighboring views. These constraints are then 
used to compute the missing fundamental matrices. 

In recent years various applications of multi-camera settings with a central 
computer are considered. These include various tasks such as surveillance, smart 
environments, tracking and virtual reality. Collins et al. [7,8] report on a large 
surveillance project consisting of 14 cameras spread over a large compound. The 
algorithm they used for calibration, which was based on known 3D scene points 
[8], was performed on a central server. The virtual-reality technology introduced 
by Kanade et al. [16,19] uses a multi-camera setup that can capture a dynamic 
event and generate new views of the observed scene. Again, the cameras were 
calibrated off-line using a central computer. Smart environments [5,13] consist of 
a distributed set of cameras spread in the environment. The cameras can detect 
and track the inhabitants, thus supporting higher-level functions such as con- 
venient man-machine interfacing or object localization. Despite the distributed 
nature of these systems, the calibration of the cameras is usually done off-line 
on a central processor. 

Karuppiah et al. [11], have already discussed the value of solving multi- 
camera computer vision problems in a distributed manner. They constructed 
a four-camera system and experimented with tracking and recognition in this 
system, showing the potential for fault-tolerance and avoiding a single point of 
failure. 

While the literature contains little in the way of distributed solutions to com- 
puter vision applications, the literature on distributed systems and distributed 
computing has addressed many issues that are relevant to a task of this type. 
They involve methods for failure detection and fault-tolerant execution of com- 
putations, algorithms for leader election and consensus, etc. A good overview of 
the issues can be found in Tanenbaum and van Steen [22] and in the collection 
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by Mullender [18]. A comprehensive source for distributed algorithms can be 
found in Lynch [14] and a useful treatment of issues relating to data replication 
is the book by Bernstein et al. [2] . 



3 The Algorithm 



We assume a set {1, . . . , n} of cameras overlooking a scene. Each camera has a 
processing unit attached to it, and the cameras can communicate over a point- 
to-point communication network. We further assume that the communication 
network is complete so that every camera can communicate directly and reliably 
with every other camera. We denote by Mi the number of cameras with which 
camera i can have corresponding points and let denote the size Mi. Initially, 
we assume that the WBS computations are noise and error free: A WBS com- 
putation performed on a pair of images identifies two locations in the images as 
being corresponding exactly if there is a genuine feature point p that appears 
in the stated coordinates in the respective images. We relax these reliability 
assumptions in Section 5. 

Our distributed algorithm is defined in terms of an exposure parameter k, 
and is designed to discover the vast majority of points with an exposure size 
k or more. Each camera maintains a list of its own feature points and their 
corresponding points in other cameras. At each propagation step, each camera, 
propagates any new correspondence information to all the cameras with which 
it has established corresponding points. Each camera has to run the following 
algorithm, given the exposure parameter k. 



1. Initialization 

Randomly choose a set S C Mi of cameras of size 



Si = miT{k) « m, 



log k + b 

’ 



( 1 ) 



and request their images. 

2. Pairwise Matching 

For every camera j from which an image has been received, perform a Y^BS 
computation between z’s image and j’s , record its results in the local cor- 
respondence lists, and send the results to j. Concurrently, for every request 
for an image, send you image to the requesting camera and later record the 
result when you receive them in the correspondence lists. 

3. Correspondence Propagation 

This stage proceeds in rounds of communication. In the first round, for every 
point Pi = (xi,yi) in camera z’s image that has been matched with more 
than one point by the WBS computations, z performs a propagation step. 
In every subsequent round, i performs propagation steps for every point pi for 
which it received new correspondence information in the most recent round. 
The propagation is terminated when no new correspondence information 
received. 
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One of the main contributions of the algorithm is in equation 1 that expresses 
the number of computations as a function of the exposure parameter k. 

As for rui, it is equal to n in case no prior information is available, but in many 
cases, we do initially have information regarding which images potentially have 
corresponding points, and which do not. Reducing the value of rrii means a 
smaller number of WBS computations, as is evident from equation 1. Consider, 
for example, a situation in which cameras are located around a hill or a rooftop. 
They may cover the surrounding scene quite effectively, while every camera has 
a limited number of relevant neighbors to consider for correspondence. There 
is another source that may reduce the size of Mi. Some recent approaches to 
computing multi-view correspondence contain a preprocessing stage in which 
images are ranked for likelihood of correspondence (e.g., [21]). The result of 
such a stage can reduce the sets M^. 

4 Probabilistic Analysis of the Algorithm 

The probabilistic analysis will show that the above algorithm will detect all 
points with exposure factor great or equal to k with probability 99%. Furthere- 
more, it will show that only log 2 {n) propagation steps are needed, at most, for 
the algorithm to terminate. 

We represent the state of information that the cameras attain regarding the 
multi- view correspondence of feature points by a labeled multi-graph, which we 
denote by G. There is a node in G for each camera. There is a labeled edge, 
({t,j},p), between nodes i and j if p is an established corresponding point of 
the images of i and j. Initially, the graph has no edges. After the first phase, in 
which kViBiS computations are performed, the graph contains edges only among 
images that were compared directly by a WBS computation. Additional edges 
are added to G in the propagation phase. 

Let us begin by considering the behavior of the algorithm in terms of dis- 
covering the correspondence information of a single 3D feature point p. Let us 
call the set of cameras that view the point p the p-set. All of the correspon- 
dence information regarding p will be uncovered exactly if, at the end of the 
propagation process, every pair of cameras in the p-set will share a p-edge. To 
analyze the algorithm’s behavior with respect to p it is convenient to consider 
the p-graph Gp derived from G that is defined by the p-set and the p-edges of G. 
More formally, Gp = {Vp, Ep) where Vp is p-set — the set of cameras that view p, 
and Ep consists of the edges {i,j} for which ({t, j},p) is in G. We refer to the 
state of Gp after the matching step of the algorithm by Gp(0), and after r > 1 
rounds of propagation by Gp(r). 

4.1 Analysis of Propagation 

We now prove that if Gp(0) is connected, then propagation will uncover the full 
correspondence information regarding p. Moreover, this will be done within a 
small number of rounds of propagation. 
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Lemma 1. If the distance between nodes i and j in Gp(0) is d, where 2’' ^ < 
d < 2’', then their distance in Gp(r) is 1. 




a 



b 



c 



Fig. 1. (a) Gp(0) — the distance between vi and vs is 4. (b) Gp(l) — the distance is 
reduced to 2, and (c) in Gp{2) the distance is 1. 



Proof. Consider two vertices v\ and Vd+i that are distance d apart. Let vi, Vd+i 
be a path connecting these points in Gp(0). In the first round of the propaga- 
tion algorithm node V 2 will update node vi that vs also views the point p, and 
similarly node V2 will also update node vs that v\ views the point p (see Figure 
1). As a result, nodes vi and vs both update their local p- lists, and the edge 
{■(bill's} is added to Gp. In a similar manner, all edges between Vi and r’i+ 2 , 
i < d + 1, are added to Gp. It follows that after a single round of propagation, 
the path vi,V3,v^, ...,Vd+i connects Vi and Vd+i in the graph. As a results, the 
distance between vi and Vd+i is shortened by a factor of two, and it is |"f 1 . A 
straightforward induction shows that the distance between vi and Vd is reduced 
to 1 after \log 2 {dy\ = r steps. 

Corollary 1. Suppose that Gp(0) is connected. If the diameter of Gp(0) is d, 
then Gp{\log{d)~\) is a complete graph (its diameter is 1). 

Since d < k < n is guaranteed. Corollary 1 implies that there is no need to 
ever run the propagation algorithm for more than \log{n)~\ rounds. 

Corollary 2. If camera i does not receive a new update regarding the point p in 
round r of the propagation phase, then i will never send or receive any further 
updates about p. 

Corollary 1 proves that propagation is guaranteed to terminate for all points 
within a small logarithmic number of rounds. Moreover, by Corollary 2 every 
camera can easily detect when its propagation phase is done. 

4.2 The Number of WBS Computations 

As we have seen, if Gp(0) is connected then the propagation phase of the algo- 
rithm will discover the correspondence information regarding p. Clearly, if Gp(0) 
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Fig. 2. The graph G with cameras that view the points pi, p 2 , and p^. The graphs Gp^ , 
Gp 2 , and Gp^ are marked with red, black, and green edges, respectively. The edges of G 
span each of the derived graphs Gp^ . 



is not connected, then only partial information will be discovered. In this section 
we determine the precise number of cameras that are selected in the initial- 
ization step of the algorithm by showing that it will ensure connectedness. We 
base the discussion here on the theory of random graphs, which was initiated 
in a paper by Erdos and Renyi [9] . Consider a random process in which each of 
the ('^) undirected edges of a graph on N nodes is chosen independently with 
probability p > 0. The resulting graph is denoted by Q{N,p). 

Lemma 2. (a) Let p{N) = o.577+inN ^ j,j^^ probability that Q{N, p{N)) is con- 
nected tends to 1 as TV tends to oo. More concretely, for small values of N 
we have 

(b) Let p{N) = . The probability that Q{N, p{N)) is connected is greater 

than 0.99 for all values of N < 40,000 

The first part of the lemma is a classical result in the field, while the results 
in the second part are from Bollobas and Thomason [4], as they are quoted in 
the excellent textbook by Bollobas [3]. Since we are unlikely to be interested in 
computing correspondence for points that are seen by more than 40,000 cameras, 
the second part gives us very good bounds to work with: For our purposes, if p has 
exposure degree k and each pair of nodes in the p-set is chosen with probability 
at least p{k) for a WBS computation, then we have high assurance (over 0.99 
probability!) that Gp(0) will be connected. 

For independent probabilistic events A and B, we have that Pr(^ U i?) = 
Pr( A) -I- Pr(R) — Pr(A) Pr(i?) . An edge is chosen if one of its nodes selects it. In 
the algorithm, if every node selects the edge with probability T(fc), then we need 
to ensure that 2r(fc) — r^(fc) > p{k) in order to guarantee edges are chosen with 
sufficient probability. The exact formula for r(fc) is thus r(fc) = 1 — 1^1 — p{k). 
However, T{k) tends to in the limit, and for all fc > 10 we have r(fc) < 
0.6p{k). So r{k) is essentially 

Our analysis so far has been in terms of connectivity of the graph Gp(0). 
Indeed, working with Gp rather than G is crucial since guaranteeing that G 
is connected would not immediately yield Gp’s connectivity. (Figure 2 is an 
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example showing that not every spanning graph of G induces spanning graphs 
of all of the graphs Gp..) However, the correspondence algorithm works at the 
level of the graph G on all cameras, with no a priori knowledge about the identity 
of the p-sets. Working at this level, if every edge is chosen with probability at 
least r(fc) then, in particular, every edge among members of the p-set is chosen 
with this probability. The desired property for ensuring connectivity of Gp(0) 
is thus satisfied. Actually, much more is true. Connectivity of Gg(0) is ensured 
with high probability at once for all points q with exposure degree k or morel 
In particular, Lemma 2(b) implies that in this case the algorithm will find the 
correspondence information for at least 99% of these points. 

In the algorithm, we guarantee that a camera i chooses each edge with prob- 
ability at least r(fc) by having it randomly choose a subset of Mi of size Si where 
^ > T{k). Choosing a subset of the neighbors of a predetermined size has the 
advantage that we can control the number of yVBS computations that every 
node performs. In summary, we have 

Theorem 1. Executing the correspondence algorithm with parameter k will, 
with high prohahility, yield the full correspondence information for at least 99% 
of the points that have exposure degree k or larger. 

5 Dealing with System Failures 

In this section we consider the properties of our algorithm when executed on 
an unreliable distributed system. We start with a classical analysis of processor 
crashes and communication failures but show that the analysis can be naturally 
extended to handle mis-matches by the WBS as simply another type of failure. 



5.1 System Crashes 

Let us first consider crashes. Assume that some of the cameras may crash during 
the operation of the algorithm. We assume further that the cameras use a timeout 
mechanism to identify that a processor is down. Clearly, if i is in a p-set and 
it crashes early on, we do not expect to necessarily discover the correspondence 
information regarding i’s image. Define the surviving degree of a feature point p 
to be its exposure degree if we ignore the cameras that crash. Crashed cameras 
do not participate in the algorithm, and their crashing does not affects the 
interactions among the surviving cameras. The original algorithm, unchanged, 
is thus guaranteed to discover all information for the point with surviving degree 
of k. 

5.2 Communication Failures 

Now consider communication failures. We assume that each channel between 
two cameras can fail with independent probability / < 1, after which it stays 
down for the duration of the algorithm. Again, our timeout mechanism can 
allow the cameras to avoid being hung waiting for messages on failed lines. 
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The worst-case behavior of a failing communication line is to be down from 
the outset. Since communication line failures are independent of the choices of 
yVBS computations made by the camera, the probability that an edge that is 
chosen by the cameras with probability p will also be up is (1 — f)p. Hence, 
we can increase p by a factor of ^md make the random graph resulting 
from the joint behavior of the cameras choices and the adversary’s failures have 
the exact same structure as it originally. This translates into the choice of a 

neighbor with probability T'{k) = 1 — instead of r(fc). In the range of 

20 < k < 50, overcoming 10% failures requires between 13% and 11% overhead, 
and to overcome a huge 25% probability of failure it suffices to choose between 
40% and 36% more cameras than in the fully reliable case. 

This discussion is summarized as follows. 

Theorem 2. (a) Executing the correspondence algorithm unchanged with pa- 
rameter k when camera crashes are possible will, with high probability, yield 
the full correspondence information for at least 99% of the points that have 
surviving exposure degree k or larger. 

(b) When communication channels may fail with probability / < 1, executing 
the algorithm with T'{k) « j^T{k) instead of T{k) will, with high probabil- 
ity, yield the full correspondence information for at least 99% of the points 
that have exposure degree k or larger. Moreover, it will not require more 
computation than the original algorithm does. 



5.3 Failure of WB5 to Detect Matches 

We next consider failures of the WBS computation to identify the fact that a 
feature point appears in two images being compared. Here we suppose that our 
WBS algorithm will fail to identify a match with independent probability / < 1. 
The situation here is very similar to the case of communication failures. Again, 
the probability that an edge of Gp will be discovered by the first part of the algo- 
rithm is (1 — /)p if edges of G are chosen with probability p. By the analysis we 
performed in the case of communication failures, choosing r'(fc) instead of t(/c) 
images to compare with will provide us with the original guarantees. This time, 
however, all r'(fc) computations and communications must be carried out. The 
total overhead is then roughly jfj : 

Theorem 3. When WBS computations may fail to identify a match with inde- 
pendent probability f < I, executing the algorithm with T'{k) « yjijT(fc) instead 
of T{k) will, with high probability, yield the full correspondence information for 
at least 99% of the points that have exposure degree k or larger. 

We remark that this analysis is applied to false-negative errors. Coping with 
false-positives — mistaken matches reported — can be done using distributed sys- 
tems’ techniques for handling malicious failures. This analysis is beyond the 
scope of this paper and is left as a topic for future work. 




438 



S. Avidan, Y. Moses, and Y. Moses 



6 Experiments 

Our analysis ensures that the algorithm will indeed recover the correspondence. 
However, the analysis is very conservative, and in practice smaller numbers of 
WBS computations should suffice. We validated this expectation through exten- 
sive simulations in MATLAB. Our scenario is a surveillance system in an urban 
setting and thus our simulated test-bed consists of a collection of orthographic 
cameras that are mounted on roof-tops looking down. Each camera observes all 
the feature points within a pre-defined distance from its position. To ensure that 
all the cameras form a single connected component, we enforce overlap between 
the image footprint of the different cameras, on the ground. 

In every experiment we run the algorithm with precisely the same data, 
but with a different number of WBS operations. The experiments show that 
the predicted number of required WBS operations is indeed sufficient, but even 
smaller numbers can be used. 

To evaluate the success of each run, we define the average number of recovered 
points for each exposure. A given point p is recovered if each camera in the p-set 
knows the identity of all cameras in the p-set. In particular, let p be a point with 
exposure degree fc, and for every i denote by Li(j>) is the size of Ts correspondence 
list for p. Then observe that Li{p) = 1 if p is fully recovered, and this value 
is smaller than 1 if p is only partially recovered. 



Recovered Correspondence: 3 WBS Recovered Correspondence: 6 WBS Recovered Correspondence: 12 WBS 




a b c 

Fig. 3. The number of recovered points for each exposure degree, E(k). Red is the 
number of points with given exposure. Blue is the average portion of correspondence 
found after the matching phase, and green is the portion after propagation. The ex- 
periment was run on a reliable system with 50 cameras and 500 points, (a) the results 
when using 3 WBS per camera, (b) when using 6 WBS and (c) when using 12. 



Let E{k) be the set of 3D points with an exposure degree k. Ideally, if all 
the points in E(k) are recovered, then the number of points with exposure k is 
given by: 

p&E{k) 
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We use the measure E{k) to evaluate the success of extracting the correspon- 
dence: when all the connections are recovered then \E{k) \ = E{k). 

For each exposure degree k, we present three values of E(k). The first, red 
bar, is the real number of points for each exposure degree. The second, green 
bar, is the final result of our algorithm. It is the number of computed recovered 
points at the end of the run. If all points for a given exposure were uncovered, 
then the green bar will cover the red bar. Finally, blue bar, is the value after 
WBS operation, that is only direct edges in the graph are considered. 

The first experiment was designed to verify the bound on the number of 
required WBS operations. We generated a setup of 50 cameras and 500 points 
(Figure 3a) and simulated the behavior of the algorithm five times, changing 
the number of WBS each time. As can be seen in Figure 3a, using just three 
WBS does not generate enough matching points and hence the algorithm does 
not fully recover any of the p-sets. As the number of WBS performed grows, 
the number of fully recovered p-sets grows. In Figure 3b and 3c, we present the 
results of running 6 and 12 WBS. As can be seen, the exposure degree from 
which full correspondence is obtained is reduced when we use the number WBS 
a camera performs increases. In Figure 3b, we present the smallest degree which 
all the p-sets of the degree were fully recovered. 




(a) 



(b) 



Fig. 4. (a) The setup with 50 cameras and 500 points. Each point is marked in blue, 
and each camera center is marked in red. The field of view of one of the cameras is 
marked in pink, (b) Full recovery with 20% errors of the WBS in red, and with no 
errors in black. 



In the second experiment we evaluated our algorithm when the WBS algo- 
rithm failed to find 20% of the natchings. The same set of cameras and points as 
in the first experiments were used, in order to compare the performance using 
perfect and imperfect WBS. The results are presented in Figure 4. 
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7 Summary 

Visual systems consisting of a large number of geographically distributed cam- 
eras are, in particular, distributed computing systems. Information is generated 
and gathered at different sites, and a communication medium is used for inte- 
grating the data being gathered. We have shown how a particular application, 
namely image correspondence across multiple cameras, can be done in a dis- 
tributed manner. 

This approach allows us to use results from distributed systems theory to an- 
alyze the complexity of the distributed algorithm. In particular, we have shown 
what is the number of pair-wise stereo matching computations required to de- 
tect, with high probability, all points that appear in a given number of cameras. 
Moreover, the analysis carries naturally to the centralized case as well. Our 
distributed approach combines naturally failures in the communication lines, 
processing units and the stereo matching algorithm in a single, coherent frame- 
work. So, we have started with a distributed approach which, we believe, should 
be the natural way to approach large scale camera settings and ended with con- 
tributions to centralized algorithms. We plan to apply this distributed analysis 
to other problems in computer vision. 
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Abstract. A novel algorithm is presented for the 3D reconstruction of human ac- 
tion in long (> 30 second) monocular image sequences. A sequence is represented 
by a small set of automatically found representative keyframes. The skeletal joint 
positions are manually located in each keyframe and mapped to all other frames 
in the sequence. For each keyframe a 3D key pose is created, and interpolation 
between these 3D body poses, together with the incorporation of limb length and 
symmetry constraints, provides a smooth initial approximation of the 3D motion. 
This is then fitted to the image data to generate a realistic 3D reconstruction. The 
degree of manual input required is controlled by the diversity of the sequence’s 
content. Sports’ footage is ideally suited to this approach as it frequently con- 
tains a limited number of repeated actions. Our method is demonstrated on a long 
(36 second) sequence of a woman playing tennis filmed with a non-stationary 
camera. This sequence required manual initialisation on < 1.5% of the frames, 
and demonstrates that the system can deal with very rapid motion, severe self- 
occlusions, motion blur and clutter occurring over several concurrent frames. The 
monocular 3D reconstruction is verified by synthesising a view from the per- 
spective of a ’ground truth’ reference camera, and the result is seen to provide a 
qualitatively accurate 3D reconstruction of the motion. 



1 Introduction 

This paper addresses the challenge of generating a qualitatively accurate 3D reconstruc- 
tion of the actions performed hy an individual in a long (~30 second) monocular image 
sequence. It is assumed the individual is not wearing any special reflective markers or 
clothing. Any solution must be able to cope with the multitude of difficulties that may 
arise over several concurrent frames: severe self-occlusion, unreliability of methods for 
limb and joint detection, motion blur, and the inherent ambiguities in reconstructing rigid 
links from monocular images [15]. Until now, the only approach guaranteed to produce 
a complete and accurate reconstruction in such circumstances is: /or each frame in the 
sequence, manually locate the skeletal joints and perform 3D reconstruction using the 
method of [15]. The latter involves solving the forward/backward binary ambiguity for 
each rigid link by inspection and estimating the relative lengths of each limb. For very 
short sequences this is a relatively painless procedure, but rapidly becomes impractical 
for longer sequences. 
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The traditional tracking approach to human motion capture [7] is to perform manual 
initialisation at the beginning of the sequence and then update the estimate of the recon- 
struction over time in accordance with the incoming data. In contrast we consider the 
entire sequence and approximate the actions present by a set of representative frames 
(automatically determined from the sequence) and from these obtain a coarse description 
of the subject’s motion. Finer detail is added by locating the skeletal joints in each frame 
by extrapolating from manually initialised joint locations on the representative frames. 

The degree of manual input required is controlled by the diversity of the sequence’s 
content. Sports’ footage is ideally suited to this approach as it frequently contains a lim- 
ited number of repeated actions. Throughout this paper the ideas and methods developed 
are illustrated and tested on a 36 second sequence of a woman playing tennis. Our results 
are verified by synthesising a view of the 3D reconstruction from the perspective of a 
reference camera not used for the reconstruction. 

The motivation for pursuing this problem together with a review of related research is 
presented in section 2. An overview of the algorithm is given in section 3 . Section 4 details 
the grouping performed to obtain a keyframe representation of a sequence. Building upon 
this representation, the skeletal joint locations in each frame are estimated (section 5). The 
procedure for constructing the 3D reconstruction of the sequence is given in section 6, 
and the final reconstructions achieved for the tennis sequence are displayed in section 7 
prior to the concluding remarks. 

2 Background 

Markerless human motion capture has drawn growing interest in recent years. The ma- 
jority of systems developed have used multiple cameras to capture the subject [2,3,7]. 
However, stereo systems are rare outside of research laboratories and studios, and the 
bulk of videos of human activity are monocular. This, together with the comparative 
ease of capturing monocular sequences, motivates the monocular problem as one of 
more than purely academic interest. 

Several researchers have tackled the challenge of human motion capture from monoc- 
ular sequences, and some impressive results have been achieved over short sequences [13, 
10]. Sminchisescu and Triggs [12,13] have achieved the most successful results to date 
in monocular markerless 3D human motion capture. Their algorithms are based upon 
propagating a mixture of Gaussians pdf, representing the probable 3D configurations of 
a body over time. Success relies upon performing efficient and thorough global searches 
of the cost surface associating the image data to potential body configurations. These 
methods have proved effective on relatively short sequences. However, it is an open 
question, whether the propagation of a multi-modal distribution, without an explicit 
mechanism for re-initialisation, is sufficient for long sequences. 

Potential disruptions to smooth tracking conditions can be bridged by imposing 
priors on the dynamics of the configuration of the body. These have been used to some 
effect [1 1,10,1]. However, this comes at a cost. The motions present in a novel sequence 
may not be adequately described by the priors in use, and the appropriate trade-off 
between fitting the image data and fulfilling the prior constraints has to to decided. Also 
for long sequences of diverse motion (e.g. tennis) no one dynamical model can fully 
explain the motions present, necessitating the introduction of some form of recognition. 
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The general problem with tracking long sequences is that it is difficult to encapsulate 
the diversity of motion in a prior model. However, it is possible to summarise the motion 
in such a sequence. Several researchers have summarised the content of video by detect- 
ing and describing the actions (or subjects) present [17,8] either by clustering together 
frames or sequences of frames with similar properties. Toyama and Blake [16] showed 
that actions in a sequence could be summarised by a set of keyframes (exemplars) ex- 
tracted from the sequence, and preceded to describe a novel video clip as a sequence of 
warped versions of these keyframes. A similar approach has been taken in more recent 
work [14,6], where sophisticated methods are used to match hand-defined keyframes to 
individual frames. Furthermore, by identifying specific joint locations on each keyframe, 
it was possible to localise these joint positions throughout a sequence. These methods, 
though only applied to short sequences, show an approach to tracking driven by pose 
recognition. This circumvents the problem of initialisation and is resistant to complete 
failure due to tracking loss, thereby opening the way to track long sequences. 

This paper extends the keyframe-based approach of Sullivan and Carlsson [14] to 
long sequences with no prior learning and no pre-defined keyframes. A subsequent 3D 
reconstruction is performed using the method of [15]. 

3 Overview of Algorithm 

Figure 1 gives an overview of the algorithm developed in this paper, from the initial 
extraction of the keyframes summarising the sequence, through the labelling of skeletal 
joint positions, formation of 3D keyframes, and interpolation of 3D keyframes, to the 
final 3D reconstruction. 

Automatically representing the sequence by a set of keyframes requires measuring the 
similarity between the poses present in every pair of frames in the sequence. A distance 
matrix summarises these similarities and is used as the basis for finding fhe represenfafive 
poses which in furn are encapsulafed in keyframes summarising the sequence. The second 
layer in figure 1 encompasses the initialisation of each keyframe: the 2D skeletal joints 
are manually labelled and their corresponding 3D reconstructions created [15]. The 
2D skeletal joints are then automatically determined throughout the sequence using 
the 2D keyframes and the keyframe assignment for each frame [14,6]. This involves 




Fig. 1. Overview of the algorithm. 
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approximating the warp of the assigned keyframe to each frame and transferring the 
defined skeletal joints accordingly. 

Next follows an initial estimation of a smooth 3D reconstruction of the sequence, 
whereby each frame deemed sufficiently close to a 3D keyframe is replaced by that 
keyframe. Interpolation occurs between these frames to estimate the intermediate frames. 
Finally the interpolated 3D reconstruction is refined to fit the estimated 2D joint locations 
throughout the sequence. This is achieved by minimising the reprojection error, while 
taking into account motion smoothness and imposing limb length and symmetry con- 
straints. This ensures that any errors in the 2D data do not result in invalid reconstructions 
of the skeleton. 

4 Defining Keyframes 

We are interested in extracting, from a sequence X = {1, • • • , iV}, a set of keyframes 
/C C I which span the body poses in X. Besides providing a summary of the content 
of the sequence, each keyframe will assist in the skeletal joint localisation in frames of 
similar appearance. Such frames are considered well-represented by a keyframe. Thus 
JC has an associated set W/c C X of frames it well-represents. The poses between two 
well-represented frames less than T frames apart, may be approximated by interpolating 
between the well-represented frames. These interpolatable frames define a set J7jc C X. 

We wish to choose the least number of keyframes that enable an accurate description 
of the pose in a percent a of the sequence’s frames. That is, we aim to find the JC with 
minimal cardinality such that 



\WK^JK\>aN ( 1 ) 

Keyframe selection is based upon a distance matrix D G describing the 

similarity in body pose between every pair of frames in the sequence. Below we explain 
how D is computed and then analysed to produce JC. 

4.1 Measuring Pose Similarity between Frames 

The subject is localised by finding the head and feet positions in each frame. This is done 
by sequentially applying colour histograms, low-pass filtering, and a radial symmetry 
operator [5] to detect round and elliptical regions of the appropriate scale and colour 
to correspond to either a head or foot. A plausible series of head and feet positions 
is isolated by hnding the most temporally consistent path of the candidate locations 
through the sequence [4]. Based on the computed head and feet locations, a bounding 
box is estimated for the subject. Figure 2 illustrates this process. 

Target regions of homogeneous colour are then extracted, and represented by directed 
edge elements'. The edges of each region are sampled at regular intervals. Each sample 
point is represented by a point vector tangent to the edge and oriented so the interior of 
the target region is to its left, see figure 2(e). 

Pairs of images can now be compared by computing a correspondence field between 
the edge points. The frames are aligned using the tracked head and feet locations, and 
each edge element matched to the closest edge element in the other image from the 
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Fig. 2. (a) Original image, (b) head- and (c) feet-like colours highlighted, low-pass filtered and 
with peaks in radial symmetry indicated — the magnitude of each peak is shown hy the size of 
the cross — (d) identified head and feet regions and resulting hounding hox, (e) directed edge 
elements of target regions. 



same coloured target regions, and whose orientation differs by less than 45 degrees. A 
comparison of the body poses can then be computed by considering the average distance 
between corresponding points, together with the percentage of edge elements for which 
a corresponding match was found. 



4.2 Distance Matrix 

Using the method described in section 4.1, we can determine the distance and the per- 
centage of successfully matched points between every i* and j* frame in the sequence. 
Putting the respective output into the matrices B, A G an initial distance matrix 

C is then computed by combining these as 

C{i,j) = A{i,j)B{i,j) + (1 - A(f,j))maxB 

The resulting matrix C gives a good indication of the dissimilar and similar frames. 
However, it can be improved. When the inter-frame distance is sufficiently small, 
C{i,j) < P, frames i and j are extremely likely to contain the same pose. In this 
case the corresponding z* and y * rows (and columns) of C should be almost identical, 
and any observed differences can be treated as noise. The final distance matrix D is 
formed by replacing each row and column of C with the average of all the rows and 
columns corresponding to frames to which it has a distance less than p. This reduces the 
noise giving a cleaner distance matrix. 

Figure 3 shows D for the upper body for an 1800 frame tennis sequence, with 
several example frames and their corresponding rows and columns in the matrix. The 
dark rectangular regions in the matrix correspond to periods where there is little change 
between frames. For the tennis sequence this equates to the player standing still in 
between strokes, such as in frames 448 and 1662. Dark diagonals (off the main diagonal) 
correspond to distinct repeated events, such as the forehand (614, 1258) and backhand 
(174, 1032) frames. Note that there is only one such dark diagonal in the rows and 
columns corresponding to frames 174 and 1032. This is because there are only two 
backhands in the sequence, and thus only one repeated event. 
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Fig. 3. The distance matrix D for the upper body pose over a 1800 frame (36 second) tennis se- 
quence, with several sample frames. Short dark diagonals correspond to forehands and backhands, 
and dark rectangular regions indicate periods where the player is standing still. 

4.3 Keyframe Selection 

We define a criterion for considering one frame to be well-represented by another. Recall 
in section 4.2 that if D(z, j) < (3, then frame i and j are considered to exhibit the same 
pose. We say that such frames are well-represented by each other. 

We now describe an algorithm to find a /C with minimal |/C| which fulfills equation 
(1). Keyframes are iteratively selected to minimize the average distance of all frames 
from their neighbouring well-represented frames. Firstly, define Cjc as: 



Then set KP 



Ck = 



N 



i=l 



min \i — f I + min \i — f\ 



0. Keyframes are repeatedly selected according to: 



ICP+P = /c(‘) U {j} 



(2) 



(3) 



where 



j = arg min C'K(t)u{fe> (4) 

l<k<N,k^KW 



until the criterion in equation (1) is satisfied. 
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This algorithm was applied to extract keyframes from an 1800 frame sequence of 
a woman playing tennis with T set to 10 and a = 0.95. The upper and lower body 
were divided, and separate distance matrices and key frames determined for each. 25 
key frames were required for the upper body and 22 for the lower body in order to satisfy 
equation (1) (see figure 4). 



4.4 A Keyframe Representation of the Seqnence 

Figure 5 shows an example upper body and lower body keyframe and the associated 
well-represented frames from the sequence. 

By representing each frame by its closest keyframe we can examine the occurrence 
of different body poses throughout the sequence. Figure 4 shows all the keyframes ex- 
tracted from the sequence and figure 6 shows which keyframe best represents each frame 
throughout the sequence. This graph characterises the pose variation in the sequence and 
the forehands and backhands are easily identified respectively by the strong peaks and 
troughs in the graph in figure 6. 







Fig. 5. Example upper and lower body keyframes, and the frames well-represented by these 
keyframes. 
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Fig. 6. Occurrence of frames associated with the various keyframes throughout sequence. This 
graph is for the upper body. 



5 Locating Joint Positions 

For each keyframe, k G 1C, its n skeletal joints, • • • , Xn,k) G 

are manually annotated. Points from the appropriate keyframe are then automatically 
mapped to every frame in the sequence to obtain an estimate of xi-^ = (xi, • • • , x^v). 
Figure 7 shows an annotated keyframe k, and joint locations estimated for a frame t, 
assigned to this keyframe. The aligned keyframe edges have been superimposed onto 
Figure 7(b). Each joint in the keyframe has associated edge points in its vicinity and 
the correspondences found between these edge points and the edge points in the frame 
t define a translation. This translation is used to transfer the joint from the keyframe to 
frame t. Once an estimate of each joint in frame t is obtained, it is refined using fhe 
appearance of fhe joints in the keyframe, and enforcing the apparent limb length ratios 
evident in the keyframe [14]. Figure 7(c) shows the final estimates. 




Fig. 7. (a) annotated keyframe k, (b) point correspondences between keyframe and well- 
represented frame, and (c) joint locations estimated for the well-represented frame t. 



6 3D Reconstruction 

The human skeleton can be modelled as an articulated chain with n/ links. Given the 
projection of the skeletal joint locations Xj onto the image plane, the number of qual- 
itatively different reconstructions, X(, is bounded by 2”‘[15] (assuming orthographic 
imaging), as each link can point either toward or away from the image plane. For an N 
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frame sequence, the number of possible reconstructions explodes to 2"'^. This enor- 
mous search-space can be pruned by imposing the physiological limitations of the human 
body [13] and bounding the motion between adj acent frames . Without prior information, 
estimating the skeleton’s configuration over the sequence requires deciding the 

optimal binary labelling at each frame based on heuristic continuity measures. 

Therefore, the crucial issue is the generation of prior information about the 3D 
conhguration of the subject in the video. From the previous section we have a set of 
keyframes, /C, which span the 2D poses in the sequence. The 3D reconstruction of these 
keyframes provides an approximate basis for the 3D poses exhibited in the sequence. 
Thus with a limited amount of manual effort we have obtained some crucial priors. 
The next section describes how these 3D keyframes are used to create a smooth initial 
estimate, X^.^, of the 3D configuration of the subject throughout the sequence. 

6.1 Establishing a Smooth Representative Reconstruction 

The elements of and their corresponding keyframe assignments define the frames in 
the sequence that are well approximated by the 3D keyframes. Replacing each of these 
frames with its appropriate keyframe, and using these as control points in a spherical 
linear interpolation (slerp) process [9] allows the approximation of intermediary frames 
not in yVic ■ Keyframes have been chosen to ensure that the temporal distance interpolated 
is never large (equation (1)). However, frequently temporally adjacent frames in W>c are 
assigned to the same keyframe. In reality they do not correspond to exactly the same 3D 
pose. One of the frames’ 3D poses will, in general, match the keyframe more accurately 
than the others, and the other frames are better approximated by interpolation between 
the keyframes that temporally bound them. 

To this end, temporal runs of frames in W>c that are well-represented by the same 
keyframe are identihed. The ht of each frame in the run to the 3D keyframe is ranked 
(ranking is based on a robust measure of the Euclidean distance between the reprojected 
3D keyframe and the frame’s estimated 2D joints). The lowest ranked frames in each 
run are iteratively omitted from the set of control points, subject to the criterion that T 
must be the maximum distance between control points. 

Once the hnal control points have been decided, the interpolation is performed to 
obtain X^.^. Figure 8 summarises the interpolation process. 

6.2 Fitting the Smooth Motion Estimate to the Joint Data 

The last task is to rehne the 3D reconstruction by allowing the localised joint locations 
Xi:jv to influence X^.^. However, the localised joint locations may contain outliers, be 
corrupted by noise and suffer from missing estimates due to self-occlusion. To ensure 
robustness to these factors, the final estimate of Xi:at is forced to be a valid trajectory 
of a human skeleton. 

Dehne j\4]y as the manifold describing all valid trajectories of length N of the 
skeleton. Then: 



Xi:N = arg min E(Xi-n) subject to Xi-^ £ Mn- 



(5) 
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Fig. 8. Visualisation of the generation of a smooth and plausible trajectory of the 3D skeleton that 
approximates the content of the video. 



where E is n cost function based on the sum of squared differences between xi:tv and 
the orthographic projection of Xi:at (denoted by 

E{X,,n) = II - *l:tv f (6) 



There is no easy characterisation of A4n, so enforcing Xi:at to belong to AIat is 
difficult. However, all members of AIat must exhibit constant limb-length throughout 
the sequence, and each joint trajectory must follow a smooth path. By forcing Xi^a? to 
satisfy these constraints, Xi:at will be on or close to A4n- 



Step 1: Translate, rotate and scale Xj.jv to fit the 2D data 
Step 2: Set i = 1. 

Step 3: Gradient descent: 

Xj^^ = Xjr^ - , 0 < A < 1. 

Step 4: Enforce constraints: Xj.jy £ Mn- 

Step 5: Increment i by one and goto Step 3. (until convergence) 

Fig. 9. The iteration steps involved in finding Xi^jv- 

By construction X^.^r G Therefore, it is used as the initial guess for the 

solution of the minimisation problem posed in equation (5). Figure 9 gives an outline of 
how the minimisation proceeds. At the end of each iteration, enforcing X^.^r G A4 at is 
approximated by resetting the limb-lengths to their correct value, and applying a low- 
pass filter to the trajectories of each joint. A large A yields faster convergence, but makes 
it more difficult to re-project the solution back onto We used A = 0.2 for our 
experiments. 

Figure 10 shows how a 3D keyframe is refined in 3D fo mafch the image data. Here 
the same 3D keyframe is modified to form two different 3D reconstructions to match 
two different forehand frames, capturing the subtle differences between the two forehand 
strokes. 
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Key pose Joint data Reconstruction Joint data Reconstruction 

Fig. 10. Result of refining a key-pose based on image-data. The key-pose is refined according to 
the different images, resulting in two different 3D poses. 



7 Results 

Our algorithm was applied to reconstructing a 36 second tennis sequence filmed with 
a non-stationary camera. During the sequence the player moves about the baseline and 
plays several forehand and backhand strokes. Our results were verified by synthesising 
a view of the 3D reconstruction from the perspective of a reference camera. Figure 1 1 
shows the experimental setup together with 3D reconstructions throughout the sequence 
and associated ’ground-truth’ frames from the reference camera. Figure 12 shows a 
reconstructed forehand, together with the reference video, and demonstrates the realistic 
smoothness of the reconstructed 3D motion. The 3D reconstruction of the complete 36 
second video is presented in the demonstration video together with the 2D tracking 
under-pinning the reconstruction. 

The video and figures 11 and 12 show the qualitative accuracy of our results, and 
demonstrate that our system can deal with a diverse range of actions recurring over 
a long sequence. Figure 13 further demonstrates how our system is able to deal with 
self-occlusion, rapid motion, clutter from the tennis racket, and motion blur. 

The system detects outliers as discontinuities in the 3D motion and fills in the missing 
data via interpolation to form a plausible trajectory. This enables the system to deal with 
isolated tracking failures. Further, the underlying recognition-based approach to the 2D 
tracking means the target is freshly detected each frame, and thus ideally placed to 
recover from ’tracking loss’. In the worst case, the 3D reconstruction will revert to the 
smooth interpolation from the keyframes (figure 14). How accurate these key poses are 
depends on how well the sequence is represented by the keyframes, this is specified by 
the user who defines a the percentage of the sequence which is well-represented by the 
keyframes. 

Our method is well-suited to action sequences with repeated events (e.g. sport). 
Furthermore, it is possible to quantify the suitability of a sequence for this form of 
reconstruction by checking how many keyframes are required to represent the desired 
percentage of the sequence. 



8 Closing Remarks 

We have presented a method for the 3D reconstruction of articulated body motion from 
a long monocular sequence. The performance of our system was demonstrated over 36 



Monocular 3D Reconstruction of Human Motion in Long Action Sequences 



453 




Camera positions used for the experiment. 




Fig. 11. Results of the reconstruction of the entire sequence. Every 50th frame of the 36s long 
sequence is shown together with the image from our reference camera. 



seconds of tennis footage and shown to provide a qualitatively accurate reconstruction. 
To our knowledge this is longest full-body 3D reconstruction attempted from markerless 
monocular image data. 
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Fig. 12. The reconstruction of one forehand stroke shown together with the images from our 
reference camera. Note the smoothness of the reconstruction. 





Clutter Rapid Motion 



Fig. 13. Examples of reconstructions achieved under difficult imaging conditions. Each case shows 
the tracked 2D data, the 3D reconstruction from the perspective of the reference camera, and the 
view from the reference camera. 




Fig. 14. The importance of maintaining a smooth motion. A large error is encountered in the joint 
localisation (a). Without enforcing motion smoothness, the frame would he reconstructed as (b). 
In (c) the reconstructed frame is shown after enforcing smoothness constraints. 
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Abstract. A number of studies have demonstrated that infrared (IR) 
imagery offers a promising alternative to visible imagery due to it’s in- 
sensitive to variations in face appearance caused by illumination changes. 
IR, however, has other limitations including that it is opaque to glass. 
The emphasis in this study is on examining the sensitivity of IR imagery 
to facial occlusion caused by eyeglasses. Our experiments indicate that 
IR-based recognition performance degrades seriously when eyeglasses are 
present in the probe image but not in the gallery image and vice versa. To 
address this serious limitation of IR, we propose fusing the two modali- 
ties, exploiting the fact that visible-based recognition is less sensitive to 
the presence or absence of eyeglasses. Our fusion scheme is pixel-based, 
operates in the wavelet domain, and employs genetic algorithms (GAs) to 
decide how to combine IR with visible information. Although our fusion 
approach was not able to fully discount illumination effects present in the 
visible images, our experimental results show substantial improvements 
recognition performance overall, and it deserves further consideration. 



1 Introduction 

Gonsiderable progress has been made in face recognition research over the last 
decade [1] especially with the development of powerful models of face appear- 
ance (e.g., eigenspaces [2]). Despite the variety of approaches and tools studied, 
however, face recognition has shown to perform satisfactorily in controlled envi- 
ronments but it is not accurate or robust enough to be deployed in uncontrolled 
environments. Several factors affect face recognition performance including pose 
variation, facial expression changes, face occlusion, and most importantly, illu- 
mination changes. 

Previous studies have demonstrated that IR imagery offers a promising al- 
ternative to visible imagery for handling variations in face appearance due to 
illumination changes more successfully. In particular, IR imagery is nearly in- 
variant to changes in ambient illumination [3] , and provides a capability for iden- 
tification under all lighting conditions including total darkness [4]. Thus, while 
visible-based algorithms opt for pure algorithmic solutions into inherent phe- 
nomenology problems, IR-based algorithms have the potential to offer simpler 
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and more robust solutions, improving performance in uncontrolled environments 
and deliberate attempts to obscure identity [5]. 

Despite its advantages, IR imagery has other limitations including that it 
is opaque to glass. Objects made of glass act as a temperature screen, com- 
pletely hiding the face parts located behind them. In this study, we examine the 
sensitivity of IR imagery to facial occlusion due to eyeglasses. To address this 
serious limitation of IR, we propose fusing IR with visible information in the 
wavelet domain using GAs [6]. To demonstrate the results of our fusion strat- 
egy, we performed extensive recognition experiments using the popular method 
of eigenfaces [2], although any other recognition method could have been used. 
Our results show overall substantial improvements in recognition performance 
using IR and visible imagery fusion than either modality alone. 

2 Review of Face Recognition in the Infrared Spectrum 

An overview of identification in the IR spectrum can be found in [7]. Below, we 
review several studies comparing the performance of visible and IR based face 
recognition. The effectiveness of visible versus IR was compared using several 
recognition algorithms in [8]. Using a database of 101 subjects without glasses, 
varying facial expression, and allowing minor lighting changes, they concluded 
that there are no significant performance differences between visible and IR 
recognition across all the algorithms tested. They also concluded that fusing 
visible and IR decision metrics represents a viable approach for enhancing face 
recognition performance. In [9,10], several different face recognition algorithms 
were tested under various lighting conditions and facial expressions. Using radio- 
metrically calibrated thermal imagery, they reported superior performance for 
IR-based recognition than visible-based recognition. In [11], the effect of light- 
ing, facial expression, and passage of time between the gallery and probe im- 
ages were examined. Although IR-based recognition outperformed visible-based 
recognition assuming lighting and facial expression changes, their experiments 
demonstrated that IR-based recognition degrades when there is substantial pas- 
sage of time between the gallery and probe images. Using fusion strategies at the 
decision level based on ranking and scoring, they were able to develop schemes 
that outperformed either modality alone. IR has also been used recently in face 
detection [12]. This approach employs multi-band feature extraction and capi- 
talizes on the unique reflectance characteristics of the human skin in the near-IR 
spectrum. 

3 Fusion of Infrared and Visible Imagery 

Despite its robustness to illumination changes, IR imagery has several draw- 
backs. First, it is sensitive to temperature changes in the surrounding environ- 
ment. Currents of cold or warm air could influence the performance of systems 
using IR imagery. As a result, IR images should be captured in a controlled 
environment. Second, it is sensitive to variations in the heat patterns of the face. 
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Factors that could contribute to these variations include facial expressions (e.g. 
open mouth), physical conditions (e.g. lack of sleep), and psychological condi- 
tions (e.g. fear, stress, excitement). Finally, IR is opaque to glass. As a result, a 
large part of the face might be occluded (e.g. by wearing eyeglasses). 

In contrast to IR imagery, visible imagery is more robust to the above factors. 
This suggests that effective algorithms to fuse information from both spectra 
have the potential to improve the state of the art in face recognition. In the 
past, fusion of visible and IR images has been successfully used for visualization 
purposes [13]. 

In this study, we consider the influence of eyeglasses to IR-based face recog- 
nition. Our experiments demonstrate that eyeglasses pose a serious problem to 
recognition performance in the IR spectrum. To remedy this problem, we pro- 
pose fusing IR with visible imagery. Visible imagery can suffer from highlights on 
the glasses under certain illumination conditions, but the problems are consider- 
ably less severe than with IR. Since IR and visible imagery capture intrinsically 
different characteristics of the observed faces, intuitively, a better face descrip- 
tion could be found by utilizing the complimentary information present in the 
two spectra. 

3.1 Fusion at Multiple Resolutions 

Pixel by pixel fusion does not preserve the spatial information in the image. In 
contrast, fusion at multiple resolution levels allows features with different spatial 
extend to be fused at the resolution at which they are most salient. In this way, 
important features appearing at lower resolutions can be preserved in the fusion 
process. 

Multiple resolution features have been used in several face recognition sys- 
tems in the past (e.g. [14]). The advantages of using different frequencies is that 
high frequencies are relatively independent of global changes in the illumination, 
while the low frequencies take into account the spatial relationships among the 
pixels and are less sensitive to noise and small changes, such as facial expression. 

The slow heat transfer through the human body causes natural low resolution 
of IR images of human face. Thus, we decided to implement our fusion strategy 
in the wavelet domain, taking into consideration the benefits of multi-resolution 
representations and the differences in resolution between the IR and visible-light 
images. Our fusion strategy is thus different from fusion strategies implemented 
at the decision level, reported earlier in the literature (i.e., [8,11]). 

3.2 Method Overview 

The proposed method contains two major steps: (a) fusion of IR and visible 
images and (b) recognition based on the fused images. Fusion is performed by 
combining the coefficients of Haar wavelet [15] decompositions of a pair of IR and 
visible images having equal size. The fusion strategy is found during a training 
phase using GAs [6]. The coefficients selected from each spectrum are put to- 
gether and the fused face image is reconstructed using the inverse wavelet trans- 
form. To demonstrate the effectiveness of the fusion solutions found by GAs, 
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we perform recognition using the popular method of eigenfaces [2] although any 
other recognition technique could have been used. 

4 Mathematical Tools and Background Information 

4.1 Eigenfaces 

The eigenface approach uses Principal Components Analysis (PCA), a classi- 
cal multivariate statistics method, to linearly project face images in a low- 
dimensional space. This space is spanned by the principal components (i.e., 
eigenvectors corresponding to the largest eigenvalues) of the distribution of the 
training images. After a face image has been projected in the eigenspace, a 
feature vector containing the coefficients of the projection is used to represent 
the face image. Representing each image I{x,y) as a, N x N vector Tj, first 
the average face 'f' is computed: '1' = ^ where R is the number of 

faces in the training set. Next, the difference of each face from the average 
face is computed: = R — Then the covariance matrix is estimated by: 

C = Sill where, A = • ■ -^r]- The eigenspace can then 

be defined by computing the eigenvectors of C. Usually, we need to keep 
a smaller number of eigenvectors Rk corresponding to the largest eigenvalues. 
Each image F is transformed by first subtracting the mean image {<!> = F — F) , 
and then projecting in the eigenspace Wi = yj F. 

4.2 Wavelet Transform (WT) 

Wavelets are a type of multi-resolution function approximation that allow for the 
hierarchical decomposition of a signal or image. In particular, they decomposes 
a given signal onto a family of functions with finite support. This family of func- 
tions is constructed by the translations and dilations of a single function called 
mother wavelet. The finite support of the mother wavelet gives exact time local- 
ization while the scaling allows extraction of different frequency components. The 
discrete wavelet transform (DWT) is defined in terms of discrete dilations and 
translations of the mother wavelet function: ipjk{t) = — k) , where 

the scaling factor j and the translation factor k are integers: j,k £ Z. The wavelet 
decomposition of a function f{t) G L‘^{R) is given by: f{t) = hj^kfpjk{t) , 

where the coefficients hj^k are the inner products of f{t) and ’ipjkit). 

4.3 Genetic Algorithms (GAs) 

GAs are a class of randomized, parallel search optimization procedures inspired 
by the mechanisms of natural selection, the process of evolution [6]. They were 
designed to efficiently search large, non-linear, poorly-understood search spaces. 
In the past, GAs have been used in target recognition [16], object recognition 
[17], face detection/ verification [18,19], and feature selection [20,21]. 

GAs operate iteratively on a population of structures, each of which repre- 
sents a candidate solution to the problem, encoded as a string of symbols (i.e.. 
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chromosome). A randomly generated set of such strings forms the initial popula- 
tion from which the GA starts its search. Three basic genetic operators guide this 
search: selection, crossover and mutation. Evaluation of each string is based on 
a fitness function which is problem-dependent. The fitness function determines 
which of the candidate solutions are better. Selection probabilistically filters out 
poor solutions and keeps high performance solutions for further investigation. 
Mutation is a very low probability operator that plays the role of restoring lost 
genetic material. Crossover in contrast is applied with high probability. It is a 
randomized yet structured operator that allows information exchange between 
the stings. 

5 Evolutionary IR and Visible Image Fusion 

Our fusion strategy operates in the wavelet domain. The goal is to find an 
appropriate way to combine the wavelet coefficients from the IR and visible 
images. The key question is which wavelet coefficients to choose and how to 
combine them. Obviously, using un-weighted averages is not appropriate since it 
assumes that the two spectra are equally important and, even further, that they 
have the same resolution which is not true. Several experiments for fusing the 
wavelet coefficients of two images have been reported in [ 22 ] . Perhaps, the most 
intuitive approach is picking the coefficients with maximum absolute value [23]. 
The higher the absolute value of a coefficient is, the higher is the probability that 
it encodes salient image features. Our experiments using this approach showed 
poor performance. 

In this paper, we propose using GAs to fuse the wavelet coefficients from the 
two spectra. Our decision to use GAs for fusion was based on several factors. 
First, the search space for the image fusion task at hand is very large. In the 
past, GAs have demonstrated good performance when searching large solution 
spaces. Much work in the genetic and evolutionary computing communities has 
led to growing understanding of why they work well and plenty of empirical 
evidence to support this claim [24,25]. Second, the problem at hand appears 
to have many suboptimal solutions. Although, GAs cannot guarantee finding a 
global optimum, they have shown to be successful in finding good local optima. 
Third, they suitable for parallelization and linear speedups are the norm, not 
the exception [26]. Finally, we have applied GAs in the past for feature selection, 
a problem very much related to fusion, with good success [20,21]. 

Encoding: In our encoding scheme, the chromosome is a bit string whose 
length is determined by the number of wavelet coefficients in the image decom- 
position. Each bit in the chromosome is associated with a wavelet coefficient 
at a specific location. The value of a bit in this array determines whether the 
corresponding wavelet coefficient is selected from the IR (e.g., 0) or from the 
visible spectrum (e.g., 1). 

Fitness Evaluation: Each individual in a generation represents a possible 
way to fuse IR with visible images. To evaluate its effectiveness, we perform 
the fusion based on the information encoded by this individual and apply the 
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eigenface approach. Recognition accuracy is computed using a validation dataset 
(see Section 7) and is used to provide a measure of fitness. 

Initial Population: In general, the initial population is generated randomly, 
(e.g., each bit in an individual is set by flipping a coin). In this way, however, we 
will end up with a population where each individual contains the same number 
of I’s and O’s on average. To explore subsets of different numbers of wavelet 
coefficients chosen from each domain, the number of I’s for each individual is 
generated randomly. Then, the I’s are randomly scattered in the chromosome. 

Selection: Our selection strategy was cross generational. Assuming a pop- 
ulation of size N, the offspring double the size of the population and we select 
the best N individuals from the combined parent-offspring population 

Crossover: In general, we do not know how different wavelet coefficients 
depend on each other. If dependent coefficients are far apart in the chromosome, 
it is more probable that traditional 1-point crossover, will destroy the schemata. 
To avoid this problem, uniform crossover is used here. The crossover probability 
used in our experiments was 0.96. 

Mutation: Mutation is a very low probability operator which flips the values 
of randomly chosen bit. The mutation probability used here was 0.02. 



6 Face Dataset 

In our experiments, we used the face database collected by Equinox Corporation 
under DARPA’s HumanID program [27]. Specifically, we used the long-wave 
infrared (LWIR) (i.e., 8/x-12y^) and the corresponding visible spectrum images 
from this database. The data was collected during a two-day period. Each pair 
of LWIR and visible light images was taken simultaneously and co-registered 
with 1/3 pixel accuracy (see Fig. 1). The LWIR images were radiometrically 
calibrated and stored as grayscale images with 12 bits per pixels. The visible 
images are also grayscale images represented with 8 bits per pixel. The size of 
the images in the database is 320x240 pixels. 

The database contains frontal faces under the following scenarios: (1) three 
different light direction - frontal and lateral (right and left); (2) three facial ex- 
pression - ’’frown”, ’’surprise” and ’’smile”; (3) vocals pronunciation expressions 
- subjects were asked to pronounce several vocals from which three representa- 
tive frames are chosen; and (4) presence of glasses - for subjects wearing glasses, 
all of the above scenarios were repeated with and without glasses. Both IR and 
visible face images were preprocessed prior to experimentation by following a 
procedure similar to that described in [9,10]. The goal of preprocessing was to 
align and scale the faces, remove background, and account for some illumination 
variations (see Fig. 1). 

7 Experimental Procedure 

In this study, we attempted to test the effect on recognition performance of 
each factor available in the Equinox database. In addition, we have performed 
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Fig. 1. Examples of visible and IR image pairs and preprocessed images 



experiments focusing on the effect of eyeglasses. For comparison purposes, we 
have attempted to evaluate our fusion strategy using a similar experimental 
protocol to that given in [9,10]. Our evaluation methodology employs a training 
set (i.e., used to compute the eigenfaces), a gallery set (i.e., set of persons enrolled 
in the system), a validation set (i.e., used in the fitness evaluation of the GA), 
and a test set (i.e., probe image set containing the images to be identified). 
Our training set contains 200 images, randomly chosen from the entire Equinox 
database. 

For recognition, we used the Euclidean distance and the first 100 principal 
components as in [9,10]. Recognition performance was measured by finding the 
percentage of the images in the test set, for which the top match is an image of 
the same person from the gallery. To mitigate for the relatively small number 
of images in the database, the average error was recorded using a three-fold 
cross-validation procedure. In particular, we split each dataset used for testing 
randomly three times by keeping only 75% of the images for testing purposes 
and the rest 25% for validation purposes. To account for performance variations 
due to random GA initialization, we averaged the results over three different 
GA runs for each test, choosing a different random seed each time. Thus, we 
performed a total of 9 runs for each gallery/test set experiment. 

7.1 Facial Expression Tests 

The test sets for the facial expression experiments include the images containing 
the three expression frames and three vocal pronunciation frames. There are 90 
subjects with a total of 1266 pairs of images for the expression frames and 1299 
for the vocal frames. Some of the subjects in these tests sets wear glasses while 
others not. Following the terminology in [9,10] we have created the following test 
sets: EA (expression frames, all illuminations), EL (expression frames, lateral il- 
luminations), EF (expression frames, frontal illumination), VA (vocal frames, all 
illumination), VL (vocal frames, lateral illumination), VF (vocal frames, frontal 
illumination). The inclusion relations among these sets are as follows: EA = EL 
U EF, VA = VL U VF, and VA n EA = 0. 

7.2 Eyeglasses Tests 

Measuring the effect of eyeglasses is done by using the expression frames. There 
are 43 subjects wearing glasses in the EA set making a total of 822 images. Fol- 
lowing the terminology in [9,10] we created the following test sets: EG (expression 
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frames with glasses, all illuminations), EnG (expression frames without glasses, 
all illuminations), EFG (expression frames with glasses, frontal illumination), 
ELG (expression frames with glasses, lateral illumination), EFnG (expression 
frames without glasses, frontal illumination), ELnG (expression frames without 
glasses, lateral illumination). The inclusion relations among these sets are as 
follows: EG = ELG U EFG, EnG = ELnG U EFnG and EG n EnG = 0. 

8 Experimental Results 

8.1 Eyeglasses 

The results shown in Table 1 illustrate that IR-based recognition is robust to 
illumination changes but performs poorly when glasses are present in the gallery 
set but not in the test set and vice versa. Gonsiderable improvements in recog- 
nition performance have been achieved in this case by fusing IR with visible 
images. The improvement was even greater when, in addition to eyeglasses, the 
test and the gallery set contained images taken under different illuminations. For 
example, in the EFG/ELnG test case the fusion approach improved recognition 
performance by 46% compared to recognition using visible-light images and by 
82% compared to recognition using LWIR images. 

Recognition using LWIR images outperformed recognition using fused images 
when the only difference between the images in the test and gallery sets was the 
direction of illumination. This is accounted to the inability of our fusion scheme 
to fully discount illumination effects contributed by the visible-light images. 
Recognition performance using visible-light images was always worse than using 
fused images. 

8.2 Facial Expression 

The facial expression tests had varying success as shown in Table 2. In general, 
fusion led to improved recognition compared to recognition using visible-light 
images. In several cases, however, the accuracy using LWIR images was higher 
than using fused images. These were cases again where the illumination dierc- 
tions between the gallery and the test sets were different. This result is consistent 
with that of the eyeglasses tests and was caused by the inability of our fusion 
scheme to fully discount the illumination effects in the visible images. Note that 
we did not performed experiments when the intersection between gallery and 
test sets is not empty. 

9 Discussion 

The presence/ absence of eyeglasses proved to be a big obstacle for IR-based 
recognition. To better understand this, let’s take a closer look of the results 
shown in Table 1. The horizontal and vertical double lines through the center 
of the table divide the table into four quadrants (i.e., I to IV, starting from 
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Table 1. Averages and standard deviations for the eyeglasses experiments. The 
columns represent the gallery set and the rows represent the test set. The first en- 
try in each cell shows the performance measured from the visible-light images, the 
second entry is from the LWIR images, and the third entry is from the fused images. 
The bottom entry shows the minimum and maximum recognition performances from 
the three cross-validation runs achieved when using the fused images. Test scenarios 
for which the test and the gallery sets had common subsets were not performed. 
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(a) (b) (c) (d) 



Fig. 2. The average performance values from Table 1, visualized as a grayscale image. 
See text for details, (a) ideal case (b) visible images (c) IR images, (d) fused images. 



the upper-right corner and moving counterclockwise). Each quadrant represents 
a set of experiments testing some specific difference between the gallery and 
the test sets: (1) Experiments in quadrant I evaluate the effect of eyeglasses 
being present in the probe but not in the gallery; (2) Experiments in quadrant 
III evaluate the effect of eyeglasses being present in the gallery but not in 
the probe; (3) Experiments along the off-diagonals within each of these two 
quadrants represent tests where the illumination conditions between the gallery 
and probe sets are the same; (4) Experiments in quadrants II and IV evaluate 
the effect illumination changes only. 

To illustrate the performance of our fusion approach, we have interpolated 
the results from Table 1 and used a simple visualization scheme to remove small 
differences and emphasize major trends in recognition performance (see Fig. 2). 
Our visualization scheme assigns a grayscale value to each average from Table 
1) with black implying 0% recognition and white 100% recognition. The empty 
cells from Table 1 are also shown in black. 

By observing Fig. 2, several interesting conclusions can be made. As expected, 
face recognition success based on IR images (see Fig. 2.(b))is not influenced by 
lighting conditions. This is supported by the prevailing white color in quadrants 
II and IV (case (3)) and by the high recognition rates in quadrants II and IV 
(case (4)). However, IR yielded very low success when eyeglasses were present in 
the gallery but not in the probe and vice-versa (cases (1) and (2)). The success 
of visible-based face recognition was relatively insensitive to subjects’ wearing 
glasses (see Fig. 2.(c)). This follows from the relatively uniform color in quadrants 
I and III (cases (1) and (2)). Lighting conditions had big influence on the success 
of face recognition in the visible domain. There are distinguishable bright lines 
along the main diagonals in quadrants I and II (case (3)). The success of face 
recognition based on fused images was similar in all four quadrants of the image 
(see Fig. 2.(d)). This implies that we were able to achieve relative insensitivity 
to both eyeglasses and variable illumination. 

The image fusion approach led to higher recognition performance compared 
to recognition in the visible spectrum but was not able to completely compensate 
for the effects of illumination direction in the visible images. We have noticed 
that in all the cases where LWIR performed better than fusion, the illumination 
direction in the gallery set was different from that in the test set (assuming no 
difference in glasses). The presence of illumination effects in the fused images 
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(a) (b) (c) (d) 



Fig. 3. The original (a) visible and (b) IR images followed by two different fused image 
results - (c) trained on a data set with lateral illumination and without glasses and 
(d) trained on a data set with glasses. 




Fig. 4. The first few eigenfaces of a fused image data set. The second and third eigen- 
faces show clear influence of the right and left lateral illumination. 



can be visually confirmed by observing the reconstructed fused images shown in 
Fig. 3, and their first eigenfaces shown in i.e., Fig. 4. Fused images had higher 
resolution compared to LWIR images, however, they were also affected by illu- 
mination effects present in the visible images. Obviously, the first eigenfaces of 
the fused images still encode the effects of illumination direction, present in the 
visible images. More effective fusion schemes (e.g., weighted averages of wavelet 
coefficients) and more powerful fitness functions (i.e., add extra terms to control 
the number of coefficients selected from different bands of each spectrum) might 
help to overcome these problems and improve fusion overall. 

Also, further consideration should be given to the existence of many opti- 
mal solutions found by the GA. Although optimal in the training phase, these 
solutions showed different recognition performances when used for testing. In 
investigating these solutions, we were not able to distinguish any pattern in the 
content of the chromosomes that might have revealed why some chromosomes 
were better than others. On the average, half of the coefficients were selected 
from the visible spectrum and the other half from the IR spectrum. The use of 
larger validation sets and more selective fitness functions might help to address 
these issues more effectively. 
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10 Conclusions and Future Work 

We presented a fusion method for combining IR and visible light images for the 
purposes of face recognition. The algorithm aims at improved and robust recog- 
nition performance across variable lighting, facial expression, and presences of 
eyeglasses. Future work includes addressing the issues mentioned in the previous 
section, considering fitness approximation schemes [28] to reduce the computa- 
tional requirements of fitness evaluation, and investigating the effect of environ- 
mental (e.g., temperature changes), physical (e.g., lack of sleep) and physiological 
conditions (e.g., fear, stress) to IR performance. 
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Abstract. Reliable detection of fiducial targets in real-world images is 
addressed in this paper. We show that even the best existing schemes are 
fragile when exposed to other than laboratory imaging conditions, and 
introduce an approach which delivers significant improvements in relia- 
bility at moderate computational cost. The key to these improvements 
is in the use of machine learning techniques, which have recently shown 
impressive results for the general object detection problem, for example 
in face detection. Although fiducial detection is an apparently simple 
special case, this paper shows why robustness to lighting, scale and fore- 
shortening can be addressed within the machine learning framework with 
greater reliability than previous, more ad-hoc, fiducial detection schemes. 



1 Introduction 

Fiducial detection is an important problem in real-world vision systems. The 
task of identifying the position of a pre-defined target within a scene is central 
to augmented reality and many image registration tasks. It requires fast, accu- 
rate registration of unique landmarks under widely varying scene and lighting 
conditions. Numerous systems have been proposed which deal with various as- 
pects of this task, but a system with reliable performance on a variety of scenes 
has not yet been reported. 

Figure 1 illustrates the difficulties inherent in a real-world solution of this 
problem, including background clutter, motion blur [1], large differences in scale, 
foreshortening, and the significant lighting changes between indoors and out. 
These difficulties mean that a reliable general-purpose solution calls for a new 
approach. In fact, the paper shows how the power of machine learning techniques, 
for example as applied to the difficult problem of generic face detection [2], can 
benefit even the most basic of computer vision tasks. 

One of the main challenges in fiducial detection is handling variations in scene 
lighting. Transitions from outdoors to indoors, backlit objects and in-camera 
lighting all cause global thresholding algorithms to fail, so present systems tend 
to use some sort of adaptive binarization to segment the features. 

The problem addressed in this paper is to design a planar pattern which can 
be reliably detected in real world scenes. We first describe the problem, then 
cover existing solutions and present a new approach. We conclude by comparing 
the learning-based and traditional approaches. 
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Fig. 1. Sample frames from test sequences. The task is to reliably detect the targets 
(four disks on a white background) which are visible in each image. It is a claim of this 
paper that, despite the apparent simplicty of this task, no technique currently in use is 
robust over a large range of scales, lighting and scene clutter. In real-world sequences, 
it is sometimes difficult even for humans to identify the target. We wish to detect the 
target with high reliability in such images. 
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Fig. 2. Overall algorithm to locate hducials. (a) Input image, (b) output from the fast 
classifier stage, (c) output from the full classifier superimposed on the original image. 
Every pixel has now been labelled as hducial or non-fiducial. The size of the circles 
indicates the scale at which that fiducial was detected, (d) The target verification step 
rejects non-target hducials through photometric and geometric checks, (e) Fiducial 
coordinates computed to subpixel accuracy. 
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2 Previous Work 

Detection of known points within an image can be broken down into two phases: 
design of the fiducials, and the algorithm to detect them under scene variations. 
The many proposed fiducial designs include: active LEDs [3,4]; black and white 
concentric circles [5]; coloured concentric circles [6,7], one-dimensional line pat- 
terns [8]; squares containing either two-dimensional bar codes [9], more general 
characters [10] or a Discrete Cosine Transform [11]; and circular ring-codes [12, 
13]. The accuracy of using circular fiducials is discussed in [14]. Three dimen- 
sional fiducials whose images directly encode the pose of the viewer have been 
propsed by [15]. We have selected a circular fiducial as the centroid is easily and 
efficiently measured to sub-pixel accuracy. Four known points are required to 
compute camera pose (a common use for fiducial detection) so we arrange four 
circles in a square pattern to form a target. The centre of the target may contain 
a barcode or other marker to allow different targets to be distinguished. 

Naimark and Foxlin [13] identify non-uniform lighting conditions as a major 
obstacle to optical fiducial detection. They implement a modified form of homo- 
morphic image processing in order to handle the widely varying contrast found 
in real-world images. This system is effective in low-light, in-camera lighting, and 
also strong side-lighting. Once a set of four ring-code fiducials have been located 
the system switches to tracking mode and only checks small windows around the 
known fiducials. The fiducial locations are predicted based on an inertial motion 
tracker. 

TRIP [12] is a vision-only system that uses adaptive thresholding [16] to 
binarize the image, and then detects the concentric circle ring-codes by ellipse 
fitting. Although the entire frame is scanned on start-up and at specified in- 
tervals, an ellipse tracking algorithm is used on intermediate frames to achieve 
real-time performance. The target image can be detected 99% of the time up to 
a distance of 3 m and angle of 70 degrees from the target normal. 

CyberCode [9] is an optical object tagging system that uses two-dimensional 
bar codes to identify object. The bar codes are located by a second moments 
search for guide bars amongst the regions of an adaptively thresholded [16] image. 
The lighting needs to be carefully controlled and the fiducial must occupy a 
significant portion of the video frame. 

The AR Toolkit [10] contains a widely used fiducial detection system that 
tracks square borders surrounding unique characters. An input frame is thresh- 
olded and then each square searched for a pre-defined identification pattern. The 
global threshold constrains the allowable lighting conditions, and the operating 
range has been measured at 3m for a 20x20 cm target [17]. 

Cho and Neumann [7] employ multi-scale concentric circles to increase their 
operating range. A set of 10 cm diameter coloured rings, arranged in a square 
target pattern similar to that used in this paper, can be detected up to 4.7m 
from the camera. 

Motion blur causes pure vision tracking algorithms to fail as the fiducials are 
no longer visible. Our learnt classifier can accomodate some degree of motion 
blur through the inclusion of relevant training data. 
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These existing systems all rely on transformations to produce invariance 
to some of the properties of real world scenes. However, lighting variation, 
scale changes and motion blur still affect performance. Rather than image pre- 
processing, we deal with these effects through machine learning. 

2.1 Detection versus Tracking 

In our system there is no prediction of the fiducial locations; the entire frame 
is processed every time. One way to increase the speed of fiducial detection is 
to only search the region located in the previous frame. This assumes that the 
target will only move a small amount between frames and causes the probability 
of tracking subsequent frames to depend on success in the current frame. As a 
result, the probability of successfully tracking through to the end of a sequence is 
the product of the frame probabilities, and rapidly falls below the usable range. 
An inertial measurement unit can provide a motion prediction [1], but there is 
still the risk that the target will fall outside the predicted region. This work 
will focus on the problem of detecting the target independently in each frame, 
without prior knowledge from the earlier frames. 

3 Strategy 

The fiducial detection strategy adopted in this paper is to collect a set of sample 
fiducial images under varying conditions, train a classifier on that set, and then 
classify a subwindow surrounding each pixel of every frame as either fiducial or 
not. There are a number of challenges, not least of which are speed and reliability. 

We begin by collecting representative training samples in the form of 12x12 
pixel images; larger fiducials are scaled down to fit. This training set is then used 
to classify subwindows as outlined in Figure 2. The classifier must be fast and 
reliable enough to perform half a million classifications per frame (one for the 
12x 12 subwindow at each location and scale) and still permit recognition of the 
target within the positive responses. 

High efficiency is achieved through the use of a cascade of classifiers [2] . The 
first stage is a fast “ideal Bayes” lookup that compares the intensities of a pair of 
pixels directly with the distribution of positive and negative sample intensities 
for the same pair. If that stage returns positive then a more discriminating (and 
expensive) tuned nearest neighbour classifier is used. This yields the probabil- 
ity that a fiducial is present at every location within the frame; non-maxima 
suppression is used to isolate the peaks for subsequent verification. 

The target verification is also done in two stages. The first checks that the 
background between fiducials is uniform and that the separating distance falls 
within the range for the scale at which the fiducials were identified. The second 
step is to check that the geometry is consitent with the corners of a square under 
perspective transformation. The final task is to compute the weighted centroid 
of each fiducial within the found target and report the coordinates. 

The following section elaborates on this strategy; first we discuss the selection 
of training data, then each stage of the classification cascade is covered in detail. 
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Fig. 3. Representative samples of positive target images. Note the wide variety of 
positive images that are all examples of a black dot on a white background. 



3.1 Training Data 

A subset of the positive training images is shown in Figure 3. These were ac- 
quired from a series of training videos using a simple tracking algorithm that was 
manually reset on failure. These samples indicate the large variations that occur 
in real-world scenes. The window size was set at 12x12 pixels, which limited 
the sample dot size to between 4 and 9 pixels in diameter; larger dots are scaled 
down by a factor of two until they fall within the specification. Samples were 
rotated and lightened or darkened to artificially increase the variation in the 
training set. This proves to be a more effective means of incorporating rotation 
and lighting invariance than ad hoc intensity normalization, as discussed in §5. 

3.2 Cascading Classifier 

The target location problem here is firmly cast as one of statistical pattern clas- 
sification. The criteria for choosing a classifier are speed and reliability: the four 
subsampled scales of a 720x576 pixel video frame contain 522,216 subwindows 
requiring classification. Similar to [2], we have adopted a system of two cascading 
probes: 

— fast Bayes decision rule classification on sets of two pixels from every window 
in the frame 

— slower, more specific nearest neighbour classifier on the subset passed by the 
first stage 

The first stage of the cascade must run very efficiently, have a near-zero false 
negative rate (so that any true positives are not rejected prematurely) and pass 
a minimal number of false positives. The second stage provides very high classi- 
fication accuracy, but may incur a higher computational cost. 



3.3 Cascade Stage One: Ideal Bayes 

The first stage of the cascade constructs an ideal Bayes decision rule from the 
positive and negative training data distributions. These were measured from 
the training data and additional positive and negative images taken from the 
training videos. The sampling procedure selects two pixels from each subwindow: 
one at the centre of the dot and the other on the background. The distribution 
of the training data is shown in Figure 4. 

The two distributions can be combined to yield a Bayes decision surface. If gp 
and g„ represent the positive and negative distributions then the classification 
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(a) (b) (c) (d) 

Fig. 4. Distribution of (a) negative pairs and (b) positive pairs used to construct 
the fast classifier, (c) ROC curve used to determine the value of a (indicated by the 
dashed line) which will produce the optimal decision surface given the costs of positive 
and negative errors, (d) The selected Bayes decision surface. 



of a given intensity pair x is: 

classification(a:) = | > 3 n{x) 

[ — 1 otherwise 

where a is the relative cost of a false negative over a false positive. The parameter 
a was varied to produce the ROC curve shown in Figure 4c. A weighting of 
a = produces the decision boundary shown in Figure 4d, and corresponds 
to a sensitivity of 0.9965 and a specificity of 0.75. 

A subwindow is marked as a possible fiducial if a series of intensity pairs all 
lie within the positive decision region. Each pair contains the central point and 
one of seven outer pixels. The outer edge pixels were selected to minimize the 
number of false positives based on the above emiprical distributions. 

The first stage of the cascade seeks dark points surrounded by lighter back- 
grounds, and thus functions is like a well-trained edge detector. Note however 
that the decision criteria is not simply {edge — eenter) > threshold as would be 
the case if the center was merely required to be darker than the outer edge. In- 
stead, the decision surface in Figure 4d encodes the fact that {dark center, dark 
edge} are more likely to be background, and {light center, light edge} are rare in 
the positive examples. Even at this early edge detection stage there are benefits 
from including learning in the algorithm. 

3.4 Cascade Stage Two: Nearest Neighbour 

Among the various methods of supervised statistical pattern recognition, the 
nearest neighbour rule [18] achieves consistently high performance [19]. The 
strategy is very simple: given a training set of examples from each class, a new 
sample is assigned the class of the nearest training example. In contrast with 
many other classifiers, this makes no a priori assumptions about the distribu- 
tions from which the training examples are drawn, other than the notion that 
nearby points will tend to be of the same class. 

For a binary classification problem given sets of positive and negative ex- 
amples {pi} and {rij}, subsets of where d is the dimensionality of the input 
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vectors (144 for the image windows tested here). The NN classifier is then for- 
mally written as 

classification(a;) = — sign(min \\pi — x|p — min \\rij — x|p) . (2) 

i 3 

This is extended in the fc-NN classifier, which reduces the effects of noisy training 
data by taking the k nearest points and assigning the class of the majority. The 
choice of k should be performed through cross-validation, though it is common 
to select k small and odd to break ties (typically 1, 3 or 5). 

One of the chief drawbacks of the nearest neighbour classifier is that it is 
slow to execute. Testing an unknown sample requires computing the distance 
to each point in the training data; as the training set gets large this can be a 
very time consuming operation. A second disadvantage derives from one of the 
technique’s advantages: that a priori knowledge cannot be included where it is 
available. We address both of these in this paper. 



Speeding Up Nearest Neighbour. There are many techniques available for 
improving the performance and speed of a nearest neighbour classification [20]. 
One approach is to pre-sort the training sets in some way (such as fcd-trees [21] 
or Voronoi cells [22]), however these become less effective as the dimensionality 
of the data increases. Another solution is to choose a subset of the training 
data such that classification by the 1-NN rule (using the subset) approximates 
the Bayes error rate [19]. This can result in significant speed improvements 
as k can now be limited to 1 and redundant data points have been removed 
from the training set. These data modification techniques can also improve the 
performance through removing points that cause mis-classifications. 

We examined two of the many techniques for obtaining a training subset: 
condensed nearest neighbour [23] and edited nearest neighbour [24]. The con- 
densed nearest neighbour algorithm is a simple pruning technique that begins 
with one example in the subset and recursively adds any examples that the sub- 
set misclassifies. Drawbacks to this technique include sensitivity to noise and no 
guarantee of the minimum consistent training set because the initial few pat- 
terns have a disproportionate affect on the outcome. Edited nearest neighbour 
is a reduction technique that removes an example if all of its neighbours are of a 
single class. This acts as a filter to remove isolated or noisy points and smooth 
the decision boundaries. Isolated points are generally considered to be noisy; 
however if no a priori knowledge of the data is assumed then the concept of 
noise is ill-defined and these points are equally likely to be valid. In our tests it 
was found that attempts to remove noisy points decreased the performance. 

The condensing algorithm was used to reduce the size of the training data 
sets as it was desirable to retain “noisy” points. Manual selection of an initial 
sample was found to increase the generalization performance. The combined 
(test and training) data was condensed from 8506 positive and 19,052 negative 
examples to 37 positive and 345 negative examples. 
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Parameterization of Nearest Neighbour. Another enhancement to the 
nearest neighbour classifier involves favouring specific training data points 
through weighting [25]. In cases where the cost of a false positive is greater 
than the cost of a false negative it is desirable to weight all negative training 
data so that negative classification is favoured. This cost parameter allows a 
ROC curve to be constructed, which is used to tune the detector based on the 
relative costs of false positive and negative classifications. 

We define the likelihood ratio to be the ratio of distances to the nearest 
negative and positive training examples: 

likelihood -ratio = nearest -negative / nearest -positive . 

In the vicinity of a target dot there will be a number of responses where this 
ratio is high. Rather than returning all pixel locations above a certain threshold 
we locally suppress all non-maxima and return the point of maximum likeli- 
hood (similar to the technique used in Harris corner detection [26]; see [27] for 
additional details). 

4 Implementation 

Implementation of the cascading classifier described in the previous section is 
straightforward; this section describes the target verification step. Figure 2c 
shows a typical example of the classifier output, where the true positive re- 
sponses are accompanied by a small number of false positives. Verification is 
merely used to identify the target amongst the positive classification responses; 
we outline one approach but there are any number of suitable techniques. 

First we compute the Delaunay triangulation of all points to identify the lines 
connecting each positive classification with its neighbours. A weighted average 
adaptive thresholding of the pixels along each line identifies those with dark 
ends and light midsections. All other lines are removed; points that retain two 
or more connecting lines are passed to a geometric check. This check takes sets of 
four points, computes the transformation to map three of them onto the corners 
of a unit right triangle, and then applies that transformation to the remaining 
point. If the mapped point is close enough to the fourth corner of a unit square 
then retrieve the original grayscale image for each fiducial and return the set of 
weighted centroid target coordinates. 

5 Discussion 

The intention of this work was to produce a fiducial detector which offered 
extremely high reliability in real-world problems. To evaluate this algorithm, a 
number of video sequences were captured with a DV camcorder and manually 
marked up to provide ground truth data. The sequences were chosen to include 
the high variability of input data under which the algorithm is expected to be 
used. It is important also to compare performance to a traditional “engineered” 
detector, and one such was implemented as described in the appendix. 
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(a) 



(b) 



Fig. 5. (a) ROC curve for the overall fiducial detector. The vertical axis displays the 
percentage of ground truth targets that were detected, (b) An enlarged view of the 
portion of (a) corresponding to the typical operating range of the learnt detector. The 
drop in detection rate at 0.955 is an artefact of the target verification stage whereby 
some valid targets are rejected due to encroaching false positive fiducials. 



Table 1. Success rate of target verification with various detectors. Normalizing each 
window prior to classification improves the success rate on some frames, but more false 
positive frames are introduced and the overall performance is worse. The engineered 
detector cannot achieve the same level of reliability as the learnt detector. 



Learnt Normalized Engineered 

detector detector detector 



Sequence 


Targets 


True 


False 


True 


False 


True 


False 


Church 


300 


98.3% 


2.0% 


99.3% 


8.7% 


46.3% 


0.0% 


Lamp 


200 


95.5% 


0.5% 


99.5% 


3.0% 


61.5% 


0.0% 


Lounge 


400 


98.8% 


0.5% 


96.5% 


1.0% 


96.5% 


0.0% 


Bar 


975 


89.3% 


0.0% 


91.3% 


0.0% 


65.7% 


0.5% 


Multiple 


2100 


95.2% 


0.7% 


93.5% 


3.5% 


83.0% 


1.4% 


Library 


325 


99.1% 


0.6% 


94.2% 


0.0% 


89.8% 


5.2% 


Summary 


4300 


94.7% 


0.4% 


94.0% 


2.3% 


77.3% 


1.1% 



The fiducial detection system was tested on six video sequences containing 
indoor/outdoor lighting, motion blur and oblique camera angles. The reader is 
encouraged to view the video of detection results available from [28] . 

Ground truth target coordinates were manually recorded for each frame and 
compared with the results of three different detection systems: learnt classi- 
fier, learnt classifier with subwindow normalization, and the engineered detector 
described in the appendix. Table 1 lists the detection and false positive rates 
for each sequence, while Table 2 lists the average number of positives found 
per frame. Overall, the fast classification stage returned just 0.33% of the sub- 
windows as positive, allowing the classification system to process the average 
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720x576 frame at four scales in 120 ms. The operating range is up to 10 m with 
a 50 mm lens and angles up to 75 degrees from the target normal. 

Normalizing each subwindow and classifying with normalized data was shown 
to increase the number of positives found. The target verification stage must then 
examine a larger number of features; since this portion of the system is currently 
implemented in Matlab alone it causes the entire algorithm to run slower. This is 
added to the increased complexity of computing the normalization of each sub- 
window prior to classification. By contrast, appending a normalized copy of the 
training data to the training set was found to increase the range of classification 
without significantly affecting the number of false positives or processing time. 
The success rate on the dimly lit bar sequence was increased from below 50% to 
89.3% by including training samples normalized to approximate dim lighting. 

Careful quantitative experiments comparing this system with the AR Toolkit 
(an example of a developed method of fiducial detection) have not yet been 
completed, however a qualitative analysis of several sequences containing both 
targets under a variety of scene conditions has been performed. Although the 
AR Toolkit performs well in the office, it fails under motion blur and when in- 
camera lighting disrupts the binarization. The template matching to identify a 
specific target does not incorporate any colour or intensity normalization and is 
therefore very sensitive to lighting changes. We deal with all of these variations 
through the inclusion of relevant training samples. 

This paper has presented a fiducial detector which has superior performance 
to reported detectors. This is because of the use of machine learning. This detec- 
tor demonstrated 95% overall performance through indoor and outdoor scenes 
including multiple scales, background clutter and motion blur. A cascade of 
classifiers permits high accuracy at low computational cost. 

The primary conclusion of the paper is the observation that even “simple” 
vision tasks become challenging when high reliability under a wide range of 
operating conditions is required. Although a well engineered ad hoc detector 
can be tuned to handle a wide range of conditions, each new application and 
environment requires that the system be more or less re-engineered. In contrast. 



Table 2. Average number of positive fiducial classifications per frame. The full classifier 
is only applied to the positive results of the fast classifier. This cascade allows the learnt 
detector to run faster and return fewer false positives than the engineered detector. 



Sequence 


True 

positives 


Fast Full 

classifier classifier 


Normalized 
full classifier 


Engineered 

detector 


Church 


4 


5790 


107 


135 


121 


Lamp 


4 


560 


23 


30 


220 


Lounge 


4 


709 


36 


55 


43 


Bar 


4 


82 


5 


6 


205 


Multiple 


7.3t 


2327 


79 


107 


96 


Library 


4 


1297 


34 


49 


82 



Average - 1794 47 64 

^The Multiple sequence contains between 1 and 3 targets per frame. 
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with appropriate strategies for managing training set size, a detector based on 
learning can be retrained for new environments without significant architectural 
changes. 

Further work will examine additional methods for reducing the computa- 
tional load of the second classifier stage. This could include Locally Sensitive 
Hashing as a fast approximation to the nearest neighbour search, or a different 
classifier altogether such as a support vector machine. 
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Appendix: Engineered Detector 

One important comparison for this work is how well it compares with tradi- 
tional ad hoc approaches to fiducial detection. In this section we outline a local 
implementation of such a system. 

Each frame is converted to grayscale, binarized using adaptive thresholding as 
described in [16], and connected components used to identify continuous regions. 
The regions are split into scale bins based on area, and under or over-sized regions 
removed. Regions are then rejected if the ratio of the covex hull area and actual 
area is too low (region not entirely filled or boundary is not continually convex) , 
or if they are too eccentric (if the axes ratio of an ellipse with the same second 
moments is too high). 
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Abstract. Statistical shape and texture appearance models are power- 
ful image representations, but previously had been restricted to 2D or 3D 
shapes with smooth surfaces and lambertian reflectance. In this paper we 
present a novel 3D appearance model using image-based rendering tech- 
niques, which can represent complex fighting conditions, structures, and 
surfaces. We construct a light field manifold capturing the multi-view 
appearance of an object class and extend the direct search algorithm of 
Cootes and Taylor to match new light fields or 2D images of an object 
to a point on this manifold. When matching to a 2D image the recon- 
structed light field can be used to render unseen views of the object. Our 
technique differs from previous view-based active appearance models in 
that model coefficients between views are explicitly linked, and that we 
do not model any pose variation within the shape model at a single view. 
It overcomes the limitations of polygonal based appearance models and 
uses light fields that are acquired in real-time. 



1 Introduction 

Appearance models are a natural and powerful way of describing objects of the 
same class. Multidimensional morphable models [13], active appearance mod- 
els [6], and their extensions have been applied to model a wide range of ob- 
ject appearance. The majority of these approaches represent objects in 2D and 
model view change by morphing between the different views of an object. Mod- 
elling a wide range of viewpoints in a single 2D appearance model is possible, 
but requires non-linear search [19]. Additionally, object self-occlusion introduces 
holes and folds in the synthesized target view which are difficult to overcome. 
Large pose variation is easily modelled using 3D; a polygonal 3D appearance 
model was proposed by Blanz and Vetter [3]. With their approach the view is 
an external parameter of the model and does not need to be modelled as shape 
variation. However, this technique is based on a textured polygonal mesh which 
has difficultly representing fine structure, complex lighting conditions and non- 
lambertian surfaces. Due to the accuracy of the 3D surfaces needed with their 
approach, the face scans of each prototype subject cannot be captured in real- 
time and fine structure such as hair cannot be acquired. 

In this paper we propose a 3D active appearance model using image-based 



T. Pajdla and J. Matas (Eds.): ECCV 2004, LNCS 3024, pp. 481-493, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




482 C.M. Christoudias, L.-P. Morency, and T. Darrell 




Fig. 1. (a) A light field appearance manifold Lmodei- Each point on the manifold is a 
4D fight field representing the 3D shape and surface reflectance of an object. The fight 
field of an object is constructed by computing its projection onto the shape-texture 
appearance manifold. A 2D input image is matched to a point on this manifold by 
interpolating the shape and texture of neighboring prototype fight fields, (b) A fight 
field can capture non-lambertian effects (e.g. glasses). 



rendering [14,11] rather than rendering with a polygonal mesh. We use a light 
field representation, which does not require any depth information to render 
novel views of the scene. With light field rendering, each model prototype con- 
sists of a set of sample views of the plenoptic function [1]. Shape is defined for 
each prototype and a combined texture-shape PCA space computed. The result- 
ing appearance manifold (see Figure 1(a)) can be matched to a light field or 2D 
image of a novel object by searching over the combined texture-shape parame- 
ters on the manifold. We extend the direct search matching algorithm of [6] to 
light fields. Specifically, we construct a Jacobian matrix consisting of intensity 
gradient light fields. A 2D image is matched by rendering the Jacobian at the 
estimated object pose. Our approach can easily model complex scenes, lighting 
effects, and can be captured in real-time using camera arrays [23,22]. 

2 Previous Work 

Statistical models based on linear manifolds of shape and/or texture variation 
have been widely applied to the modelling, tracking, and recognition of objects [2, 
8,13,17]. In these methods small amounts of pose change are typically modeled 
implicitly as part of shape variation on the linear manifold. For representing 
objects with large amounts of rotation, nonlinear models have been proposed, but 
are complex to optimize [19] . An alternative approach to capturing pose variation 




Light Field Appearance Manifolds 483 





(a) 



(b) 



Fig. 2. (a) Light field camera array [23]. (b) A 6x8 light field of the average head. The 
light field prototypes were acquired using the 6 top rows of the camera array due to 
field of view constraints. 



is to use an explicit multi-view representation which builds a PCA model at 
several viewpoints. This approach has been used for pure intensity models [16] 
as well as shape and texture models [7]. A model of inter- view variation can be 
recovered using the approach in [7], and missing views could be reconstructed. 
However, in this approach pose change is encoded as shape variation, in contrast 
to 3D approaches where pose is an external parameter. Additionally, views were 
relatively sparse, and individual features were not matched across views. 

Shape models with 3D features have the advantage that viewpoint change can 
be explicitly optimized while matching or rendering the model. Blanz and Vetter 
[3] showed how a morphable model could be created from 3D range scans of hu- 
man heads. This approach represented objects as simply textured 3D shapes, and 
relied on high-resolution range scanners to construct a model; non-lambertian 
and dynamic effects are difficult to capture using this framework. With some 
manual intervention, 3D models can be learned directly from monocular video 
[9,18]; an automatic method for computing a 3D morphable model from video 
was shown in [4]. These methods all used textured polygonal mesh models for 
representing and rendering shape. 

Multi- view 2D [7] and textured polygonal 3D [3,9,18] appearance models can- 
not model objets with complex surface reflectance. Image-based models have be- 
come popular in computer graphics recently and can capture these phenomenon; 
with an image-based model, 3D object appearance is captured in a set of sampled 
views or ray bundles. Light field [14] and lumigraph [11] rendering techniques 
create new images by resampling the set of stored rays that represent an object. 
Most recently the unstructured lumigraph [5] was proposed, and generalized the 
light field/lumigraph representation to handle arbitrary camera placement and 
geometric proxies. 
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Recently, Gross et. al. [12] have proposed eigen light fields, a PCA-based 
appearance model built using light fields. They extend the approach of Turk and 
Pentland [21] to light fields and define a robust pose-invariant face recognition 
algorithm using the resulting model. A method to morph two lightfields was 
presented in [24]; this algorithm extended the classic Beier and Neely algorithm 
to work directly on the sampled lightfield representation and to account for 
self-occlusion across views. Features were manually defined, and only a morph 
between two (synthetically rendered) light fields was shown in their work. 

In this paper we develop the concept of a light field active appearance model, 
in which 3 or more light fields are “vectorized” (in the sense of [2]) and placed 
in correspondence. We construct a light field morphable model of facial appear- 
ance from real images, and show how that model can be automatically matched 
to single static intensity images with non-lambertian effects (e.g. glasses). Our 
model differs from the multi-view appearance model of [7] in that we build 
a 4D representation of appearance with light fields. With our method, model 
coefficients between views are explicitly linked and we do not model any pose 
variation within the shape model at a single view. We are therefore able to model 
self-occlusion and complex lighting effects better than a multi-view AAM. We 
support this claim in our experimental results section. 



3 Light Field Shape and Texture 

In this section we provide a formal description of the shape and texture of a 
set of light field prototypes that define the appearance manifold of an object 
class. Let L{u, v, s, t) be a light field consisting of a set of sample views of the 
scene, parameterized by view indices {u,v) and scene radiance indices {s,t), 
and let Li,...,L„ be a set of prototype light fields with shape Ai,...,A„. In 
general, for any image-based rendering technique, Xi is a set of 3D feature points 
which outline the shape of the imaged object. With a light field, no 3D shape 
information is needed to render a novel view of the object. It is therefore sufficient 
to represent the shape of each light field as the set of 2D feature points, which 
are the projections of the 3D features into each view. More formally, we define 
the shape. A, of a light field L as 

X = {x(u,v)\{u,v) G L} (1) 

where is the shape in a view (u,v) of L. If the camera array is strongly 

calibrated its sufficient to find correspondences in two views and re-project to the 
remaining views. With only weak calibration and the assumption of a densely 
sampled array, feature points may be specified in select views of the light field 
and tracked into all other views. 

Once shape is defined for each prototype light field, Procrustes analysis [10] 
is performed to place the shape of each object into a common coordinate frame. 
Effectively, Procrustes analysis applies a rigid body transformation to the shape 
of each light field such that each object is aligned to the same 3D pose. From 
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the set of normalized shapes Xi of each prototype, the reference shape Xref is 
computed as 

Xref = Me,X (2) 

where X is the mean shape of the aligned shapes and Mq, is a matrix which scales 
and translates the mean shape such that it is expressed in pixel coordinates (i.e. 
with respect to the height and width of each discrete view of a light field). The 
matrix constrains the shape in each view of the reference light field to be 
within the height and width of the view. 

As in [2], the texture of a prototype light field is its “shape free” equivalent. 
It is found by warping each light field to the reference shape Xref - As will be 
shown in the next section, this allows for the definition of a texture vector space 
that is decoupled from shape variation. Specifically, the texture of a light field 
L is defined as 



G{u, V, s, t) = L{D{u, V, s, t)) = L o D{u, v, s, t) (3) 

where D is the mapping, 

— >Tl‘^ (4) 

that specifies for each ray in L a corresponding ray in the reference light field 
Lref and is computed using the shape of L and Xref - Equation (3) may be 
thought of as a light field warping operation, a concept introduced by Zhang et. 
al. [24]. As in [6], the texture of each prototype, Gj, is normalized to be under 
the same global illumination. 

4 Light Field Appearance Manifolds 

As illustrated in the previous section, once a reference is defined, each prototype 
light field may be described in terms of its shape and texture. The linear com- 
bination of texture and shape form an appearance manifold: given a set of light 
fields of the same object class, the linear combination of their texture warped 
by a linear combination of their shape describes a new object whose shape and 
texture are spanned by that of the prototype light fields. Compact and efficient 
linear models of shape and texture variation may be obtained using PCA, as 
shown in [6]. Given the set of prototype light fields Li, ..., L„, each having shape 
Xi and texture Gi, PCA is applied independently to the normalized shape and 
texture vectors, Xi and Gi to give 



X = X + Pshs 
G=G+Pghg 

Using Equation (5), the shape and texture of each model light field is described 
by its corresponding shape and texture parameters bg and hg. As there may 
exist a correlation between texture and shape, a more compact model of shape 
and texture variation is obtained by performing a PCA on the concatenated 
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shape and texture parameter vectors of each prototype light field. This results 
in a combined texture-shape PC A space: 



X = X + QgC 
G = G + QgC 



(6) 



where as in [6], 

Q _ p W“lP 

O — P P ^ 

and Ws is a matrix which comensurates the variation in shape and texture 
when performing the combined texture-shape PCA. In our experiments we use 
Ws = rl where r = Here a1 and represent the total variance of 

the normalized shape and texture. Equation (6) maps each model light field to 
a vector c in the combined texture-shape PCA space. To generalize the model 
to allow for arbitrary 3D pose and global illumination, Equation (6) may be 
re-defined as follows, 

A„ = A*(A + Q,c) 

G^ = T„(G+Qgc) 

where St is a function that applies a rigid body transformation to the model 
shape according to a pose parameter vector t, T„ is a function which scales 
and shifts the model texture using an illumination parameter vector u, and the 
parameter vectors t and u are as defined in [6]. Note, the reference light field has 
parameters c = 0, t = a and u = 0, where a is a pose vector that is equivalent 
to the matrix in Equation (2). 

The light field appearance manifold is defined as, 



G model — Gm ^ 



(9) 



where Lmodei is a model light field that maps to a point on the appearance 
manifold and Dm is a 4D deformation field which maps each ray in the reference 
light field to a ray in the model light field and is computed using the shape 
of the model light field, Xm, and the shape of the reference light field, Xref- 
Note, Equation (9) suggests that an optical flow technique may also be used to 
represent shape as in [13] to build a light field active appearance model. We have 
implemented both approaches, and below report results using the feature-based 
shape representation of Section 3. 



5 Model Matching 

In this section, we show how to generalize the matching technique of [6] to light 
fields. We first illustrate how to match a light field and then discuss the more 
interesting task of fitting a model light field to a single 2D image. 
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Matching to a Light Field. A novel light field, Lg, is matched to a point c on 
the texture-shape appearance manifold by minimizing the following non-linear 
objective function: 

E{p) = \G^-Gs\‘^ (10) 

where = (c^|t’^|u^) are the parameters of the model, Gm is the model 
texture and Gg is the normalized texture of Lg assuming it has shape Gg 
is computed by warping Lg from Xm to the reference shape X^ef- The model 
shape and texture are computed at p using Equation (8). 

The direct search gradient descent algorithm of [6] is easily extendible to a 
light field active appearance model. In [6] a linear relationship for the change in 
image intensity with respect to the change in model parameters was derived via 
a first order Taylor expansion of the residual function r(p) = Gm — Gg = i5g. 
In particular, given a point p on the manifold, the parameter gradient that 
minimizes the objective function (10) was computed as, i5p = — R<jg, where the 
matrix R is the pseudo-inverse of the Jacobian, J = derived from the Taylor 
expansion of the residual function. 

In a 2D active appearance model the columns of the Jacobian are intensity 
gradient images which model how image intensity changes with respect to each 
model parameter and vice versa. Analogously, the Jacobian of a light field active 
appearance model represents the change in light field intensity with respect to 
the change in model parameters, each of columns representing light field intensity 
gradients that describe the intensity change across all the views of a light field. 
Consequently, the algorithm for minimizing Equation (10) follows directly from 
[6]. As in a 2D AAM, the Jacobian is learned via numerical differentiation. 

Matching to an Image. A more interesting extension of the AAM framework 
arises when performing direct search to match a light field AAM to a single 
2D image; with a light field the Jacobian matrix is rendered based on pose. A 
novel image lg is matched to a point on the light field appearance manifold by 
minimizing the objective. 



E{p,e) = \F{Gm,e)-gs\^ (11) 

where e is the camera pose of A, F is a function that renders the pose e of the 
model texture [14,5] and gg is the texture of lg assuming it has shape Xm- 9s is 
computed by warping Ig from Xm to the reference shape Xref- Both 2D shapes 
are obtained by rendering Xm and Xref into view e using, 

x = F,{X,e) (12) 

where F^ is a variant of the light field rendering function F\ it renders shape in 
view e via a linear interpolation of of the 2D shape features defined in each view 
of A. 

Overall, the objective function in Equation (11) compares the novel 2D image 
to the corresponding view in L model- Minimizing this objective function fits a 
model light field, Lmodei, that best approximates I in view e. An efficient way to 
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optimize Equation (11) is by defining a two step iteration process, in which the 
pose e is optimized independently of the model parameters p. The pose e may 
be computed via an exhaustive search of the average light field, L^ef, in which 
cross-correlation is used to initialize e to a nearby discrete view of the model light 
field. The pose parameter t is used to further refine this pose estimate during 
matching. 

Once e is approximated, direct search may be employed to match / to a point 
on the texture-shape appearance manifold. As previously discussed, each column 
of the Jacobian, J of a light field active appearance model is a light field intensity 
gradient. To approximate the intensity gradient in view e of the target image 
/, light field rendering is applied to each column of J. This yields a “rendered” 
Jacobian matrix, Jg, specified as, 

J* = F(J*,e),i = l,...,m (13) 

where J* represents column i of the matrix J and m is the number of columns in 
J. Note similar to the model and image textures of Equation (10) the columns 
of Jj have shape Xref defined above. 

Using Jg, optimizing Equation (11) is analogous to matching / to a 2D AAM. 
Thus, as in Equation (10), the direct search gradient descent algorithm of [6] is 
used to minimize Equation (11), with one exception. In [6] the normalized mean 
of the texture vectors is used to project gs into the same global illumination of 
the model texture. With a light field AAM the normalized mean texture is a 
light field, and thus cannot be directly applied to normalize gs in Equation (11). 
Instead, we normalize both g^ = F(Gm,s) and gs to have zero mean and unit 
variance. We found this normalization scheme to work well in our experiments. 

6 Experiments 

We built a light field morphable model of the human head by capturing light 
fields of 50 subjects using a real-time light field camera array [23]. We collected 
48 views (6 x 8) of each individual and manually segmented the head from each 
light field. Our head database consists of 37 males and 13 females of various 
races. Of these people, 7 are bearded and 17 are wearing glasses. The images in 
each view of the prototype light fields have resolution 320 x 240. Within each 
image, the head spans a region of approximately 80 x 120 pixels. The field of 
view captured by the camera array is approximately 25 degrees horizontally and 
20 degrees vertically. To perform feature tracking, as described in Section 3, we 
used a multi-resolution Lukas-Kanade optical flow algorithm [15], with 4 pyramid 
levels and Laplacian smoothing For comparison, we built a view-based AAM 
using the views of the light field camera array [7]. In both the definition of the 
view-based and light field active appearance models the parameter perturbations 
displayed in Table 1 were used to numerically compute the Jacobian matrix. To 
avoid over-fitting to noise, texture-shape PCA vectors having low variance were 

^ We acknowledge Tony Ezzat for the Lukas-Kanade optical flow implementation. 
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Table 1. Pertnrbation scheme nsed in both the view-based and light field AAMs. [20] 



Variables Perturbations 

x,y ±5% and ±10% of the height and width of the reference shape 
6 ±5, ±15 degrees 

scale ±5%, ±15% 

Ci-k ±0.25, ±0.5 standard deviations 



discarded from each model, the remaining PCA vectors modelling 90% of the 
total model variance. 

We implemented the view-based and light field active appearance models in 
MATLAB. To perform light field rendering we use the unstructured lumigraph 
algorithm described in [5]. In our experiments, our matching algorithm typically 
converged between 4 and 15 iterations when matching to an image and between 
4 and 10 iterations when matching to a light field. Each iteration took a few 
seconds in un-optimized MATLAB. We believe that using a real-time light field 
renderer [5] would result in matching times similar to those reported for a 2D 
AAM [20]. 

7 Results 

In this section we provide a comparison between a light field and a 2D view-based 
active appearance model. We then present various model matching experiments 
using our head light field appearance manifold. 

Comparison to a View-Based AAM. To compare our method to a view- 
based AAM we built a single- view 2D AAM and compared it against a light field 
AAM. Each model was constructed using all fifty subjects, and was matched to 
a side view of two people. The resulting fits are displayed in Figure 3. In this 
figure one person is wearing glasses which self-occlude the subject in extreme 
views of the camera array. These self-occlusions are difficult to model using a 
view-based AAM, where inter-pose variation is modelled as shape. Also note 
that the view-dependent texturing effects in the persons glasses are preserved 
by the light field AAM, but are lost by the view-based AAM even though the 
person remains in the model. 

Model Matching. To demonstrate the ability to fit a light field AAM to a 
single 2D image or light field, we match a novel person to the constructed head 
manifold using “leave-one-out” experimentation. Figure 4 illustrates fitting light 
fields of two people taken out of the model. To conserve space, only select views 
of each light field are displayed. Both fits are shown superimposed onto the 
corresponding input light field. Each light field is also provided for ground truth 
comparison. As seen from the figure, the input light fields are well matched and 
a convincing reconstruction of each person is generated. Specifically, the shape 
and texture of both individuals is well captured across views. 
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Input View-Based Light Field 
AAM AAM 



Fig. 3. Comparison of a light field active appearance model to a view-based AAM. The 
left column shows the input, the middle column the best fit with a 2D AAM, and the 
right column the light field fit. The 2D and light field appearance models both exhibit 
qualitatively good fits when the surface is approximately smooth and lambertian. When 
glasses are present, however, the 2D method fails and the light field appearance model 
succeeds. 




Ground Truth Fit Ground Truth Fit 

Fig. 4. Matching a light field AAM to a light field of a novel subject. 



Figure 5 illustrates our model’s ability to generate convincing light field re- 
constructions from 2D images. This figure provides two example matches to 2D 
images with known pose. For each match, the person was removed from the 
model and imaged at a randomly selected pose not present in the light field 
AAM. The fit, rendered at the selected pose of each person, is displayed below 
each input image. The fitted light fields are also displayed. Note our method 
built a light field with 48 views from a single 2D image. 
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light field light field 



Fig. 5. Matching a light field AAM to 2D images of novel subjects. Each person is 
matched at a known pose. The reconstructed light field, is rendered over the input 
view and is displayed aside each match. The light field appearance model generates 
convincing light held reconstructions from 2D images. In particular, the overall shape 
and texture of each subject are well approximated across each view. 






W- ' m 






1 r 
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Frontal Pose 




light field 



Ground Truth 




Fig. 6. Matching a light held AAM using automatic pose estimation (side pose). A 
match to a frontal, known pose is also provided for comparison. Note the reconstructed 
light helds are the same for both poses. Ground truth is shown on the right for com- 
parison. 



Figure 6 displays a fit to the head model using an unknown view of a person, 
in which pose was automatically estimated as described in Section 5. The model 
was also matched to a frontal view to verify that the reconstructed light fields are 
independent of input pose. As before this person is removed from the model and 
the views are not present in the light field AAM. The extreme views of the model 
light field fits are overlaid onto a captured light field of the subject. This light 
field is also shown as ground truth. Comparing each fit one finds that although 
the characteristics of the matched views are favored, the reconstructed light fields 
are strikingly similar. Also, note the view-dependent texturing effects present in 
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the subjects glasses, captured by the model. Comparing the matches of the above 
figure, one finds that our algorithm performs well in matching novel light fields 
and 2D images to the head manifold. Namely, the skin color, facial hair, and 
overall shape and expression of each novel subject are well approximated. 



8 Conclusion and Future Work 

We introduced a novel active appearance modeling method based on an image- 
based rendering technique. Light field active appearance models overcome many 
of the limitations presented by current 2D and 3D appearance models. They 
easily model complex scenes, non-lambertian surfaces, and view variation. We 
demonstrated the construction of a light field manifold of the human head using 
50 subjects and showed how to match the model to a light field or single 2D 
image of a person outside of the model. In future work we hope to construct 
a camera array with a wider field of view that utilizes a non-planar camera 
configuration. We expect our approach to scale directly to the construction of 
dynamic light-field appearance manifolds, since our capture apparatus works in 
real-time. 
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Abstract. In this paper we develop a systematic theory about local 
structure of moving images in terms of Galilean differential invariants. 
We argue that Galilean invariants are useful for studying moving im- 
ages as they disregard constant motion that typically depends on the 
motion of the observer or the observed object, and only describe relative 
motion that might capture surface shape and motion boundaries. The 
set of Galilean invariants for moving images also contains the Euclidean 
invariants for (still) images. 

Gomplete sets of Galilean invariants are derived for two main cases: when 
the spatio-temporal gradient cuts the image plane and when it is tangent 
to the image plane. The former case correspond to isophote curve motion 
and the later to creation and disappearance of image structure, a case 
that is not well captured by the theory of optical flow. 

The derived invariants are shown to be describable in terms of accelera- 
tion, divergence, rotation and deformation of image structure. 

The described theory is completely based on bottom up computation 
from local spatio-temporal image information. 



1 Introduction 

The aim of this paper is to describe the local (differential) structure of moving 
images. By doing this we want to find a set of local differential descriptors that 
can describe local spatio-temporal pattern much as e.g. gradient strength, Lapla- 
cian zero-crossings, blob and ridge detectors, isophote curvature etc describe the 
local structure in images. 

The dominating approach to computational visual motion processing (re- 
viewed in [2,15]) is to first compute the optical flow field, i.e. the velocity vec- 
tors of the particles in the visual observer’s field of view, projected on its vi- 
sual sensor area. From this various properties of the surrounding scene can be 
computed. Ego-motion can, under certain circumstances, be computed from the 
global shape of the field, object boundaries from discontinuities in the field, and 
surface shape and motion for rigid objects, can be computed from the local 
differential structure of the field [12,13]. 

Unfortunately the computation of the optical flow field leads to a number of 
well known difficulties. The input is the projected (gray-level) image of the sur- 
roundings as a function of time, i.e. a three-dimensional structure. It is in general 
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not possible to uniquely identify what path through the spatio-temporal image 
is a projection of a certain object point. Thus, further assumptions are needed, 
the most common one is the brightness constancy assumption, that the projec- 
tion of each object point has a constant gray level. The brightness constancy 
assumption breaks down if the light changes, if the object have non-Lambertian 
reflection, or, if it has specular reflections. However, the problem is still under- 
determined, generically. Except at local extrema in the gray-level image, points 
with a certain gray-level lie along curves, and these curves sweep out surfaces 
in the spatio-temporal image. A point along such a curve can therefore corre- 
spond to any point on the surface at later instants of time. This is refered to as 
the aperture problem and is usually treated by invoking additional constraints 
e.g. regularization assumptions, such as smoothly varying brightness patterns, 
or parameterized surface models and trajectory models, leading to least-square 
methods applied in small image regions. Beside the questionable validity of these 
assumptions they lead to inferior results near motion boundaries, i.e. the regions 
that carry most information about object boundaries. The behavior when new 
image structure appears or old structure disappears is also undefined. 

An alternative approach for visual motion analysis is to directly analyze the 
geometrical structure of the spatio-temporal input image, thereby avoiding the 
detour through the optic flow estimation step [18,19,11]. By using the differential 
geometry of the spatio-temporal image, we get a low level syntactical description 
of the moving image whithout having to rely on the more high level semantic 
concept of object particle motion. 

A systematic study of the local image structure, in the context of scale-space 
theory, has been pursued by Florack [6] . The basic idea is to And all descriptors 
of differential image structure that are invariant to rotation and translation (the 
Euclidean group). The choice of Euclidean invariance reflects that the image 
structures should be possible to recognize in spite of (small) camera translations 
and rotations around the optical axis. This theory embeds many of the operators 
previously used in computer vision, such as Canny’s edge detector, Laplacian 
zero-crossings, blobs, isophote curvature and as well enabling the discovery of 
new ones. 

2 Spatio-Temporal Image Geometry 

Extending from a theory about spatial images to one about spatio-temporal im- 
ages it is natural to use the concept of absolute time (see e.g. [8] for a more 
elaborate discussion). Each point in space-time can be designated numeric label 
describing what time it occurred. The sets of space-time points that occurred at 
the same time are called planes of simultaneity and the temporal distance be- 
tween two planes of simultaneity can be measured (in the small spatio-temporal 
regions that seeing creatures, operates in, we see no need for handling relativistic 
effects, there are however other opinions, see [10]). The space-time can be strati- 
fied in a sequence of planes of simultaneity, and be given coordinate systems that 
separates time and space, {t,x) G IR x IR^. From the consequences of absolute 
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time, we conclude that we only want to allow for space-time transformations 
that never mixes the planes of simultaneity. 

As a spatio-temporal image restricted to a plane of simultaneity can be con- 
sidered as a still image the reasons for using Euclidean invariance in the image 
plane applies to moving images as well. Image properties should not be depen- 
dent on when we choose to measure them (invariance under time translations). 
The local average velocity contains only information about the ego motion and 
no information about the three dimensional structure of the environment, and is 
therefore natural to disregard. We thus search for properties that are invariant 
to the 2-1-1 dimensional Galilean group. The use of Galilean image geometry 
has been proposed in e.g. [4,1,9]. Using parallel projection as image formation 
model, the Galilean invariants are those properties of the surrounding that can- 
not be explained in terms of a relative constant translational motion. A Galilean 
model of the moving image is also implicitly assumed when divergence, curl and 
deformation are described as flow field invariants [12]. 

Definition 1 (Galilean group). The group of Galilean motions 




x,v G M”, t G JR, R G SO{n) and a G Tn+i- 

Each Galilean motion can be decomposed in a spatial rotation, a spatio-temporal 
shear (constant velocity) and a space-time translation. It can be shown that 
planes of simultaneity (constant time) are invariant and has Euclidean geometry, 
i.e. distances and angles are invariants. The temporal distance between planes 
of simultaneity is invariant. 

3 Moving Frames 

The Galilean geometry has no metric in traditional sense. That means that 
metric based differential geometry cannot be used in its normal formulations. 
We therefore chose to use a Lie group based approach instead (see [14] for a 
different approach on a geometry with degenerate metric). 

According to Klein’s famous Erlangen program, given a space S and a group 
of transformations G over S', the geometric structure of (S, G) is all structure 
that is invariant to transformations in G. In the following we will study the 
differential geometric properties of scalar functions and sub-manifolds (curves 
and surfaces) in and IR^ subject to Galilean and in some cases Euclidean 
transformations . 

A convenient way to And geometrical structure is to use Gartan theory about 
moving frames [3,16]. A frame field is a smooth map from the base space to group 
elements. S' — >■ G. For a Galilean geometry Tn+i the frame held is a mapping 
M" -G A frame field can be conceptiualized by its action on an arbitrary 

coordinate system for the tangent space of the base space. For Tn+i we can e.g. 
attach a Galilean ON-system at each point. 
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Definition 2. A Fn+i coordinate system is an affine coordinate system where n 
vectors lies in the spatial part. A ON-system is a Fn+i coordinate system 

s.t. the spatial part consists of n dimensional ON-coordinate system and the 
remaining base vector has unit temporal length. 

The property of beeing a -T„+i ON-system is a Galilean invariant. In the sequel 
we will use the coordinate system view of frame fields as we find it easier to 
visualise. 

The main idea of Cartans theory about moving frames is to put a frame at 
each point that is connected to the local structure of the sub-manifold or the 
function in an invariant way. In this way we get a frame field. 

For a function / defined on S, all expressions over mixed derivatives w.r.t. 
the Cartan frame at a certain point are by construction geometrical invariants. 
This class of invariants are called differential invariants. 

On sub-manifolds, we can find the local geometrical structure from how the 
frame field varies in the local neighborhood. 

Let i be any (global) frame and e a frame connected to the local structure 
s.t. e = Ai, where the attitude transformation A & G is & function of position. 
The local variation of e can be described in an invariant way in terms of e, 

de = dAi = dAA-ffi = C{A)e, (2) 

where the one-form (see [3]) C{A) is called the connection matrix. In a certain 
sense, the connection matrix contains all geometric information there is. 

Scalar invariants can be generated by contracting the coefficients in the con- 
nection matrix on the vectors in the Cartan frame, CijCk. A useful property of 
the connection matrix is, 

C{AB) = C{A) + AC{B)A-\ (3) 

which is a direct consequence of the definition. 

The level-sets /“^(c) of smooth scalar functions / are sub-manifolds, the 
geometric structure of those, the level-set invariants, are invariant w.r.t. the 
group of constant monotonic transformations (/ o /, g : IR — >■ IR, g' > 0. 

4 Image Geometry 

Now we will study Galilean differential geometry of moving images using Cartan 
frames. Image spaces can be considered being trivial fiber bundle S ® I, where 

5 is the base space and the fiber / is log intensity [14]. Most of the time we will 
discuss the image geometry in terms of an arbitrary section of the fiber bundle 
i.e. functions / : S' — >■ /. We will start by revieving differential geometry for 
images over E 2 to illustrate the metod of moving frames and as E 2 is a sub 
geometry of T 3 so that we will need these results later anyway. We continue 
by studying differential geometry of F 2 and, which is our main goal, differential 
geometry of images over T 3 

For scalar functions over E 2 there are two typical situations: the gradient is 
non-zero almost everywhere and it is zero along curves. 
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4.1 Gradient Gauge 



We study the geometry of functions / in £ 2 - For points p where V/ 0 we 
attach an ON-frame s.t. /„ = 0. (u,v) is a gauge coordinate system. 



du 

dv 



1 [fy-fA 
IIV/II fv ) 




(4) 



where {dx,dy} is a global ON-frame. 

All functions over 5^5^/, i + j > 1 becomes invariants w.r.t. rotations in 
space and translation in the intensity fibers. From (2) we get the anti-symmetric 
connection matrix: 

where, 



C12 = 



{fx fxy fyfxx)dx {fxfyy fyfxy)dy 

f2 f2 
J X ' J V 



= -^du+-^dv. ( 6 ) 

Jv Jv 



where the expression is simplified by the use of the {5„, d„} coordinate system, 
and the relation /„ = 0. By contracting C 12 on the components in the Cartan 
frame we arrive at: 



Theorem 1. A complete set of lev el- curve invariants for scalar functions on E 2 
is the level curve curvature, and the flow line curvature, 



^ — ^12hiu — fuu/fv, P — — f uv ! fv (7) 

These are invariants w.r.t. rotation in the plane and monotonic transforma- 
tions in the intensity fibers. 



4.2 Hessian Gauge 

The ON-frame (4) is not defined on critical points, V/ = 0, on typical critical 
points we can instead use an ON-frame {dp,dq} that diagonalize the Hessian, 
i-6- fpq — 0 and \fpp\ > |/ijg|- 




where tan2(/) = fxy/ifyy — fxx)- All functions over 5*5^, i j > 2, becomes 
invariants w.r.t. the unimodular isotropic group, i.e. rotation in the image plane 
and adition of a linear light gradient [14]. The Hessian frame {dp, dq] is invariant 
w.r.t. the isotropic group, i.e. all the motion in the isotropic group as well as 
scaling in the plane and in the intensity fiber [14]. 
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5 Functions in F 2 

First let us study the general geometrical situation for I 2 . The attitude trans- 
formation must be of the form: 




where v, is a function of the spatio-temporal position and {9s, is the adapted 
frame. We immediately see that du = dx- The connection matrix becomes: 

c^(^) = (o"S') (10) 

where Cqi = Vtdt + Vxdx. This could be expressed in the adapted coordinate 
system instead, giving cqi = Vgds + Vudu. If the coefficient in the connection 
matrix is contracted on the vectors in the adapted frame, we get two scalar 
invariants, a = coids = Vg, that describe how the spatio-temporal part of the 
frame changes in the direction of it self, i.e. it describes the acceleration of the 
structure that the frame is adapted to. The other scalar invariant, S = coi9„ = 
Vu, describes how the spatio-temporal part of the adapted frame changes in the 
spatial direction, i.e. the divergence of the vector field dg, restricted to the spatial 
line. 

For scalar functions on l 2 , there are three typical situations, the level curves 
are transverse to the spatial lines almost everywhere, along isolated curves the 
level curves are tangent to the spatial lines and there are also isolated critical 
points. 

If one uses the constant brightness assumption as binding hypothesis between 
image patterns and surface motion then the level curves, (or isophotes) corre- 
sponds to motion in the traversal case and creation or annihilation of structure 
in the non-transversal case. 



5.1 Spatially Transversal Level Curves 

On points where the level curve is transverse to the spatial line, fx yf 0, we 
can define a / 2 -frame, |9s,9a;}, s.t. fg = 0. Expressed in an arbitrary / 2 -frame, 
{dt, dx}, dg must be on the form: 

dg = dt + 'ydx, ( 11 ) 

using fg = 0 and solving for 7, we get 7 = —ft/fx- Hence the attitude matrix 
becomes, 

and for the connection matrix (10), we get: 

ftftx fxftt , ftfxx fxftx , fss J fsx 7 /. 

7 ^ at -I 7 ^ dx = -—ds —dx. (13) 

Jx Jx jx Jx 



COI = 
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Contracting cqi on the vectors of the adapted frame we get our scalar invariants, 
the invariants are summarized in the following theorem. 

Theorem 2. A complete set of level-curve invariants for spatially transversal 
level-curves on I 2 is level-curve acceleration the level-curve divergence 

O' — Voids = fss/ fx^ d = Coidx — fsxf fx- ( 1 ^) 



5.2 Hessian Invariants 



On points where fx = 0, there is no tangent gauge. For points where fxx ^ 0, 
we can define a Hessian gauge, i.e. an adapted Galilean ON-frame {ds,dx\ s.t. 
fsx = 0. Repeating the steps from the last section, applying (11) on fx, using 
fsx = 0 and solving for 7, we get the attitude transformation: 




1 -ftx/ fxx \ 

0 1 j 




(15) 



and in the connection matrix (10), we get: 

f ssx 7 f sxx j I r j /I 

Coi = — - — as — = ads -\- 0 dx. (16) 

J XX J XX 

Which we summarize in the following theorem. 

Theorem 3. A complete set of Hessian invariants for points where fxx 0 on 
l 2 is Hessian acceleration and Hessian divergence 

a = Colds — fssxf fxxt d = CQidx = fsxxf fxx- (1"^) 



6 Functions in F3 

For Galilean 2 + 1 dimensional geometry, the attitude matrix in general have the 
form: 

/dt\ (i v^ yy \ ( dt\ 

I I = I 0 cos 6 *— sin 0 1 ( 1 = Ai, (18) 

\dy J y 0 sin 0 cos 0 J \dy J 

where v^ , yy and 9 are functions of the spatio-temporal position. It can be shown 
that the connection matrix expressed in the adapted coordinate system has the 
form: 

/O a“ds + + cr“dr! a'^ds -\- a^du -\- S'^dv\ /O Cqi cq2\ 

C{A)= I 0 0 p ds -\- du -\- k" dv j = j 0 0 C12 j . 

\{) —{pds -\- K^du -\- k " dv) 0 j \0 — C12 0 j 

(19) 

Here cqi and C02 describes how the spatio-temporal part of the frame moves in 
different directions, cqi describes the motion projected on the {i9s,9u} plane. 
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and co 2 the motion projected on the {9s, 9^} plane. The form ci 2 describes how 
the spatial frame |9„, 9„} rotates when moving in different directions. 

Contracting the connection forms on the different vectors in the local adapted 
frame, we get nine different scalar invariants. We continue by giving these invari- 
ants an interpretation. If we consider the integral curves from the vector field 
{9g}, then a“ describe the acceleration of the integral curve projected on the 
|9s,9u} plane, and a'" the corresponding acceleration on the |9s,9^} plane, p 
describes how much the spatial part of the frame rotates in the 9g direction. 
The invariants and k'" describe the curvatures of the integral curves for the 
vector fields |9„} and {9„} respectively. The remaining invariants describe how 
the vector field {9^} changes for motions in the spatial plane, 5“ and 5’' describe 
the divergence in the 9„ and 9„ directions respectively. cr„ describes the skew of 
the vector field in the 9„ direction while moving in the 9„ direction and the 
skew in the 9„ direction while moving in the 9„ direction. 



6.1 More Descriptive Invariants 



Even if the above discussed set of scalar invariants constitute a complete set 
of scalar invariants for /a, they are not necessarily the ones that have largest 
descriptive value. As any invertible transformation of the scalar invariants give 
rise to a new complete set of scalar invariants, we will develop a set of invariants 
that are closer to what have been used in other work about moving images. 

The acceleration invariants |a“,a“} could instead be described in a polar 
coordinate system: 

a = \J (a“)2 -I- ag = arctan(a"/a“), (20) 

here a is the magnitude of the acceleration, an ag the angle relative to the 9„ 
direction. The invariants, (5„, <5„, cr„, cr^ describes how 9« changes along motions 
in the spatial plane. Observe that the vectors in the vector field |9s} always 
have unit length in the temporal direction, therefore the vector field restricted 
to a certain spatial plane can be projected onto that plane without losing any 
essential information. The matrix: 



D = 




(21) 



is the rate of strain tensor for that projected vector field and it might be more 
useful to describe the invariants in terms of the Cauchy-Stokes decomposition 
theorem [12]: 



D = 



(Ju O' y 
2 

curlD 



0 1 
-1 0 



0 1 
-1 0 



/ 


10 


2 V 


0 1 


divD / 1 0 \ 


2 1 0 1 


r 




defD 



2 




(22) 

(23) 



First the matrix can be decomposed in an anti symmetric and a symmetric part 
where the coefficient of the anti symmetric part is called the curl that describes 
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the rotational component of the vector field. The symmetric part can in turn 
be decomposed in a multiple of the identity matrix, the divergence part that 
describe the dilation component of the vector field, and a symmetric matrix with 
zero trace. The remaining symmetric component of the matrix can be described 
in terms of the deformation, i.e. an area preserving stretching in one direction 
combined with shrinking in the orthogonal direction, and the direction (f of the 
stretching relative to the direction of 

6.2 Choice of Gauge 

For Galilean 2+1 dimensional geometry isophotes are typically 2 dimensional 
surfaces. There are two generic cases: points where the isophote surface cuts 
the spatial surface through the point, and points where the isophote surface is 
tangent to the spatial surface through the point. The first case can be interpreted 
as motion of isophote curves in the image, and the second case as creation, 
annihilation or saddle points. 



6.3 Tangent Gauge 

Our next task is to define an adapted frame for points where the isophote surface 
cuts the spatial surface. For the spatial plane we can reuse the tangent gauge for 
E 2 in Section 4.1. Starting from an arbitrary frame i, we first adapt the spatial 
sub frame {d^, dy}, to the gradient and tangent direction in the spatial plane: 



(d, 

du 

\d„ 




1 0 0 \ 
0 fy -f. 

Of. fy J 




(24) 



The spatio-temporal vector ds must have unit length in time to be part of a 
Galilean frame. By requiring dg to lie in the spatio-temporal tangent plane, i.e. 
fs = 0, it is constrained in one direction. The adapted spatio-temporal direction 
must have the form: 

ds = dt + Pdu + jdy, 

in terms of the new frame. Using 0 = fs = ft + ifv and solving for 7 we get 
that 7 = —ft/fv Still we have one undetermined degree of freedom /3 G M. For 
each choice of (3 we have a plane spanned by {9g,(9„}. The image restricted to 
such a plane is a function on Uj and can be studied by the methods from Section 
5.1. From Theorem 2 there are two scalar invariants: acceleration a = —fss/fv 
and divergence 5 = —fsv/fv We can see that acceleration becomes a quadratic 
function of j3 and thus the gauge can be fixed by finding a f3 s.t. a{(3) is an 
extremum, i.e. by solving dpa{(3) = 0 for /?, which gives: 




Which is defined as long as fuu 0, i.e. as long as the isophote curvature 
in the spatial plane is non- vanishing. It can be shown that requirement of an 
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acceleration extrema is equivalent to requiring fsu = 0 i.e. finding a /3 that 
diagonalizes the Hessian matrix in the {9s)5„} plane. We will also see that this 
choice of gauge makes the direction of the spatial tangent, (9„, constant along 
ds, i.e. p = 0. From this requirement Guichard [9], derived the same gauge as we 
use here. 

Another choice of spatio-temporal gauge can be found by studying the di- 
vergence as a function of j3. The divergence is a linear function of j3 and the 
disappearance of the divergence, 6{l3) = 0, is a natural way to fixate the gauge, 
giving: 

^ ftfvv ftv 

Ps = ( 26 ) 

JvJuv Juv 

This is defined as long as f^v ^ 0, i.e. when the flow line curvature in the spatial 
plane is non- vanishing. It can be shown that the disappearance of the divergence 
is equivalent to requiring that jsv = 0, i.e. finding a f3 such that the Hessian in 
the {ds,dy} plane is diagonalized. 

Using (25) and (24) we find the attitude matrix for the acceleration based 
tangent gauge. 



9. 

du 

d„ 



ftfuv 
fvfuu ' 

1 

0 



ftu 

fuu 





(27) 



The connection matrix can then be found by a tedious but elementary calculation 
using (3). Using notation from our general discussion about Is invariants the 
elements in the connection matrix (19) becomes: 

coi = ds + i5“ du + ct“ dv, cq 2 = a" ds + dv, c \2 = ndu + p, dv. (28) 



Observe that the skew invariant that describe the skew in the gradient di- 
rection while moving in the tangent direction, disappear. The spatio-temporal 
rotation of the frame in the spatial plane p disappears as well. We use the con- 
ventional notation k = = k'", for isophote and flow line curvature. We list 

the resulting scalar invariants in the following theorem. 

Theorem 4. A complete set of scalar invariants for scalar functions on at 
points where the gradient and isophote curvature are non-vanishing are acceler- 
ation in the tangent and gradient direction, 



a = 



fssfu 

fvfu, 



u 

fu 



a =—- 



(29) 



divergence in the tangent and gradient direction and skew in the gradient direc- 
tion while moving in the tangent direction, 



(5" = -• 



5'’ = - 



fv ’ 



a = 



fsvfuv f SI 

fvfuu fu 



(30) 



OS well as isophote and flow line curvature, (see Theorem 1). 
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The invariant a", is also found in [9] and is denoted accel. The reasoning 
leading to Theorem 4 can be repeated for the divergence based tangent gauge 
(26). 

Theorem 5. A complete set of scalar invariants for scalar functions on at 
points where the gradient and flow line curvature are non-vanishing are acceler- 
ation in the tangent and gradient direction, 



fssfv 

fvfm 



U 

fu 



fv 



(31) 



divergence in the tangent and gradient direction, skew in the gradient direction 
while moving in the tangent direction. 



5 “ = 



fsufvv fs 



fvfuv fuv 

and isophote and flow line curvature, (see Theorem 1). 



J~V fsu u f svv /oo^ 

0 — — , <J — - , (32) 

Jv Juv 



6.4 Hessian Gauge 

On points where the isophote surface is tangent to the spatial surface, the tangent 
gauge is not defined. As long as the Hessian is non-degenerate, which generically 
is the case, we can define an adapted /3-frame, {dr, dp, dq} that diagonalize the 
Hessian, i.e. fpq = frp = frq = 0. Using the fact that the spatio-temporal vector 
in the adapted frame must be on the form. 



dr = dt-k (3dr + 'ydy. 



(33) 



Starting by diagonalizing the Hessian in the spatio-temporal direction we get 
the constraints frx = fry = 0, and by using (33) and solving for j3 and 7, we get 



/3 = 



ftyfxy ftxfy 
fxxfyy — ff, 



7 = 



xy 



ftxfxy ftyfx 
fxxfyy — ff, 



(34) 



xy 



This gives the first part of the attitude transformation, a spatio-temporal shear 
A. If we project dr on the spatial plane we get the same vector field as when 
the optical flow constraint equation is used on the gradient of the image [17]. 
As the next step the frame must be rotated in the spatial plane s.t. the spatial 
Hessian is diagonalized. Here we can use the results for the Hessian gauge for 
if 2 reviewed in Section 4.2. Combining these steps we get. 



( dr\ /l 0 0 \ ( ^ P ^\ ( ^t\ 

I I = I 0 cos (() — sin (() 1 I 0 1 0 1 I 1 = BAi, (35) 

\dq J y 0 sin (/) cos 0 J yOOly \9y / 



where tan 2(f> = fxyj {fyy — fxx)- We proceed using (3) and the same reasoning as 
for the tangent based frames in the preceding section and arives to the folowing 
theorem. 
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Theorem 6. A complete set of scalar invariants for scalar functions on at 
points where the Hessian is non- degenerate are as follow: 

~ frrpf fpp fF = frrqf f qq 

= -frpp/fpp = -frqq/fqq 

= -frpq/fpp Cr9 = -frpq/fqq (36) 

P — frpq/if^fpp ~ “^fqq) 

kP = fppq/{2fpp - 2fgg) = fpqq/{2fp, ~ 2/,,). 

Observe that in contrast to the tangent based gauge systems the Hessian 
gauge has all the scalar invariants listed in (19). 

7 Conclusion and Discussion 

In this paper we have developed a systematic theory about local structure of 
moving images in terms of Galilean differential invariants. We have argued that 
Galilean invariants are useful for studying moving images as it disregard constant 
motion that typically depends on the motion of the observer or the observed ob- 
ject, and only describe relative motion that might capture surface shape and 
motion boundaries. The set of Galilean invariants for moving images also con- 
tains the Euclidean invariants for (still) images. 

Gomparing to using optic flow as the basic element for describing image mo- 
tion, the above suggested theory is completely bottom up and local, while optic 
flow is based on trying to directly interpreting the image motion in terms of (the 
projection of) motion of object surface points. The estimation of optic flow is 
non-local as it typically is based on gathering statistics about low level features in 
a small spatio-temporal surrounding. There are also Galilean differential invari- 
ants that can capture creation and disappearance of image structure, situations 
that are not covered by the concept of optic flow. 

Experimental work is of course needed for evaluating how useful the suggested 
theory is for finding structure in real image sequences. Spatio-temporal images 
derivatives cannot be measured in a point, an integration over a non- vanishing 
spatio-temporal volume is needed [7], i.e. we need filters for measuring deriva- 
tives. As there are no localized filters that are invariant w.r.t. Galilean shear [5], a 
family of velocity adapted filters is needed. For computing a Galilean differential 
invariant, the velocity adapted filter used for measuring it should have the same 
spatio-temporal direction as the spatio-temporally directed gauge coordinate for 
the invariant. This could either be implemented by searching over a precom- 
puted set of spatio-temporally directed derivative filters or by iteratively adapt 
the spatio-temporal direction of the filter. It should be noted that in general, 
gauge adapted derivative filters can be found for several spatio-temporal direc- 
tions at a point, i.e. for real image sequences the invariants can be multi-valued. 
This can be the case for e.g. transparent motion. 
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Abstract. Recent techniques for multi-camera tracking have relied on 
either overlap between the fields of view of the cameras or on a visible 
ground plane. We show that if information about the dynamics of the 
target is available, we can estimate the trajectory of the target without 
visible ground planes or overlapping cameras. 



1 Introduction 

We explore the problem of tracking individuals using a network of non-over- 
lapping cameras. Recent techniques for multi-camera tracking have relied on two 
kinds of cues. Some rely on overlap between the fields of view of the cameras to 
calculate the real-world coordinate of the target. Others assume that the ground 
plane is visible and map image points known to lie on the ground plane to the 
real world using homography. In this paper, we show that if information about 
the dynamics of the target is available, we can estimate trajectories without 
visible ground planes or overlapping cameras. 

We are interested in instrumenting as large an environment as possible with 
a small number of cameras. To maximize the coverage area of the network, the 
camera fields of view (FOVs) rarely overlap. Further, in our indoor setting, we 
wish to use cameras that may not have a clear view of the ground plane due to 
occlusions or their horizontal orientation. 

Without a visible ground plane, each camera can only estimate the bearing 
of the ray from the camera optical center to the target. Therefore the target’s 
location can only be determined up to a scale factor with a single camera. How- 
ever, information about a target’s dynamics can be helpful in localizing it. For 
example, if a target is known to be moving at a given constant speed in the 
ground plane, its location can be fully recovered by matching its speed in the 
image plane to its ground-plane speed. See Figure 1(a). 

If the target’s speed is unknown but constant, it’s trajectory can be estimated 
with two non-overlapping cameras. See Figure 1(b). The time interval in which 
the target leaves the first field of view and enters the second one is inversely 
proportional to the velocity of the target. Using the second camera helps us 
recover the speed, which in turns allows us to localize the target. 
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(a) (b) 

Fig. 1. (a) A horizontally mounted camera recovers the location of a target up to a scale 
factor. If the target’s speed is known, the ambignity disappears. For example, given its 
true speed, if the target is seen to be moving slowly, it must be far away. If it appears 
to be moving fast, it mnst be near, (b) A second camera provides enough information 
to estimate the (constant) speed of the target, making it possible to localize it. 



The cases where speed is constant and known, or constant and unknown, 
are straightforward to handle. In this paper, we generalize to the case where the 
target moves with varying but smooth velocity. These dynamics can be modeled 
with a Gauss-Markov process. Given the dynamics of the target, we search for 
a trajectory that is most compatible with these dynamics and the observations 
made by the cameras. The resulting trajectories capture the gross features of 
the motion of the target and use the dynamics to interpolate sections of the 
trajectory not observed by the cameras. 

We incorporate the smoothness of the trajectory as a prior in the Bayesian 
framework (Sect. 3). The camera measurements will provide observations that 
define a likelihood on trajectories (Sect. 4). The Maximum a Posteriori (MAP) 
trajectory can be recovered by iteratively solving a quadratic program (Sect. 5). 
We validate our system on both synthetic and real data (Sect. 6 and Sect. 7). 

2 Related Work 

Recently, there has been a significant amount of work in tracking people across 
multiple views. Some of the proposed approaches seek to hand off image-based 
tracking from camera to camera without recovering real-world coordinates [1]. 
We focus on those that recover the real-world coordinate of the person. Multi- 
camera person trackers can be categorized as overlapping systems and non- 
overlapping systems. 

Tracking with overlapping cameras has relied on either narrow baseline stereo 
[4,3] or wide baseline matching [5]. These methods use the correspondence across 
views to determine the location of the target in the real world. 

Most research with non-overlapping cameras has focused on maintaining con- 
sistent identity between multiple targets as they exit one field of view and enter 
another [8,6]. This is known as the data association problem. These techniques 
cannot help determine the real-world position of a target if individual cameras 
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cannot make this determination individually. They provide machinery for estab- 
lishing correspondences across disjoint views, but not for localization. 

Caspi and Irani [2] provided another example where correspondence could 
be avoided. They showed how to align a pair of image sequences acquired from 
non-overlapping cameras, when the object being imaged spans the field of views 
of two non-overlapping cameras. The coherent motion of a planar object as seen 
by two nearby cameras can compensate for the lack of correspondence. In our 
case, coherent target dynamics provide this kind of coherence. 

3 Trajectory Model 

We assume that each camera can identify each person in its field of view from 
frame to frame. This allows us to track individuals independently of each other. 
For the rest of this paper, we assume that the tracking problem is decoupled in 
this way and we only discuss tracking each individual separately. This system 
could be augmented by more sophisticated person identification schemes than 
the one described in Sect. 7, such as the ones discussed in section Sect. 2. 

We use a linear Gaussian state-space model to describe the smoothness of the 
trajectory on the ground plane. This will define a prior p{X) on the trajectory 
to be estimated. 

Define the state Xt of the target at time t as: 

Xt= [ut Ut Vt Vt]^ . 

Where Ut and Vt are the x and y locations of the target on the ground plane, 
and lit and Vt describe the target’s instantaneous velocity. We assume that the 
state evolves according to linear Gaussian Markov dynamics: 



xt+i = Axt + Vt, 



( 1 ) 



where vt is a zero- mean Gaussian random variable with covariance S,,. For ex- 
ample, in the synthetic example of §6 we set 



A = 



1 0.5 0 0 
0 10 0 
0 0 1 0.5 
0 0 0 1 



K = 10-®diag ( [ 10-* 110-*!]), 



so that each xt+i adds the velocities in xt to the positions in xt, and nudges the 
old velocities by Gaussian noise. The resulting poses are also nudged by a small 
amount of Gaussian noise. 

The states form a Markov chain over time. X, the collection of states from 
time 1 to time T is itself a Gaussian random variable of dimension 1 x 4T : 



p{xt\xt-i) = Af {xt\Axt-i, X,,) 

T 

p{X) = ^p{xt\xt-i) = N {X\Q,Ax) , (2) 

t=i 
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where is tri-diagonal. We will use the Cholesky decomposition of in §5: 

Ax = G^G 

Gt=[0--- ,y^A,~y^,0---], (3) 

where Gt is row t of G and is the Cholesky factor of Uiy. Equation (3) is 
easily derived from the quadratic form inside p(X) defined in equation (2). 



4 Observation Model 

In general, we wish to consider an object tracked by a set of oblique non- 
overlapping cameras with no visible ground plane. To simplify our task, we 
consider the case where cameras are mounted horizontally, so that only the hor- 
izontal direction in the image plane is relevant in locating a person. In indoor 
settings, horizontal cameras are a way to cover a large area. 

Let p* be the location of camera i on the ground plane with respect to some 
reference, and let 0* its rotation (yaw). Denote the focal length of the camera 

by f- 

Ideally, the width of each person could be used to gauge the distance to 
the target. But our system uses background subtraction, which yields crude 
segmentation, partly because both the moving region and the uncovered region 
are identified as foreground pixels. Therefore we ignore the width of the clusters 
as a depth cue. 

Let yl be the horizontal location of the target as seen in the image plane of 
camera i at time t. This measurement is the bearing of the target with respect 
to the camera, and is computed by projecting the target’s location onto the 
camera’s focal plane: 



yl = TT\xt)+UJt 



G = 



10 0 0 
0 0 10 



,.Rl(Gxt-p^) 

Rl{Cxt-p^) 



( 4 ) 

( 5 ) 



Here, R* is the rotation matrix corresponding to 0*, and tut is a zero mean 
Gaussian random variable with variance G is a 2 x 4 matrix that extracts 
the location of the target from its state. 

When the target is within the field of view of a camera, (4) describes a 
likelihood model for each measurement. When the target is out of the field of 
view of a camera, that camera reports 0: 



p{yl\xt) 



(yi\'^Kxt),a^) , if T{xt) 
5{yl — fh), otherwise. 



where I* (x) is an indicator function that determines whether a given point falls 
within the field of view of camera i. 
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Conditioned on the true location of the target, the measurements are inde- 
pendent of each other. Then letting V be the collection of all measurements from 
all sensors from time t = 1 to t = T, 

T N 

p{Y\x ) = 

It will be useful to decompose the likelihood into constraints and observations: 
p{Y\X)= n J^{yl\7T\xt),a^) n 

(i,d60 

where (t,i) G O <1=^ y\ ^ tj). The second product of factors is a set of con- 
straints that insure the estimated trajectory only goes through fields of view 
that have actually seen the target. In Sect. 5, we will need to use these con- 
straints in a quadratic program. However, quadratic programming requires a 
convex constraint set, and these constraints are not convex. So we relax them 
the constraints I* by defining the function so that = 0 if Xt is behind 

camera i, and 1 if it is in front of it. The new likelihood becomes: 

p{Y\x)= n p{y\\^t) n 

(t.dGO (t,i)GO 

As is shown in the following section, this new constraint set is convex. It 
says that every measurement must have emanated from a sensor that had the 
target in front of it, but it does not penalize the situation where a trajectory 
crosses a field of view without generating an observation. We use this new, more 
permissive likelihood function to find the MAP trajectory. 

5 MAP Trajectory Estimation 

The most probable trajectory given all the camera observations is 

A* = argmaxp(A|F) = argmaxp(A)p(F|A) (6) 

X X 

A local maximum can be found by iteratively approximating this optimization 
problem with a quadratic program of the form: 

A* = argmaxA^QA + X (7) 

X 

s.t. AX > b 

We show how to transform the optimization of equation (6) into a sequence of 
quadratic programs of the form of equation (7). 

Taking the log of p{X)p{Y\X) and dropping terms that don’t depend on A 
yields a new quantity to minimize: 

E{X) = X^ GX + ^ ^ {Tr\xt)-ytf+ ^ log J\xt) 

“ (t.dGO (i,j)GO 
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The last term serves as a constraint, so the maximization of E over X can take 
the form of a non-linear least-squares program: 

X* = argmine(X) = argmin -r(A')^r(X) 

X X A 

s-t. {t,i)^oJ\xt) = 1 

where 



r Gx 1 



r{X) 



( 7 uj 



(i,t) G O. 



Each constraint J'^{xi) = 1 can be recast as a linear constraint for each observed 
point. Figure 2 shows that a point is in front of the sensor if its projection onto 
the camera optical axis is positive. Each field of view constraint becomes an 
inequality constraint: 

J\xt) = \ ^ >0, 

where n{9'^) is a vector pointing along the optical axis of camera i. Let the rows 
of matrix A and the elements of vector b be: 

= [O--- cos(6»0 0sin(6l*) O---] , 

K = n{9’')^p\ 

with one row per observed trajectory point. The non-linear program becomes a 
non-linear program on a convex domain: 

X* = argmine(A') (8) 

X 

s.t. AX > b. (9) 



To convert equations (8,9) into a quadratic program, we linearize r{X) about 
a guess Xq 



e{X)^\\r{Xo) + J{X-XoW, (10) 



where J is the Jacobian of r{X): 



J = 



dr 



G 



(7uj dx ‘ 



where non-zero terms below G align with the element of X involved in each error 
term. 
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Fig. 2. The arrow is the camera optical axis n(6). The gray region is the held of view 
of the camera. The dot product of n(9) and a target location x is positive if the target 
is in front of the camera. 



Substituting (10) into (8), we get a constrained least-squares problem: 

Xi = argmin ||r(Xo) -h J(X - Xo)f, 
s.t. AX > b. 

This is a quadratic program in X. Notice that J is very sparse, with exactly 
2 non-zero elements in each row. This expedites finding the optimum with QP 
solvers such as LOQO [9]. 

Iteratively linearizing r and solving this QP is similar to optimizing e using 
Newton-Raphson with inequality constraints. 



6 Synthetic Results 

We simulated our approach with synthetic trajectories and sensor measurements. 
Figure 3 depicts the synthetic setup. Sensors are placed around a square envi- 
ronment, and the target’s motion is generated randomly. Whenever the target 
hits the wall, it is refiected back. This trajectory is smoothed and passed to 
synthetic cameras which generate measurements. The state-space model of (1) 
cannot capture these operations, but we show here that state-space dynamics 
are sufficient to generate paths that capture the qualitative motion of the target. 

The optimization must begin with an initial guess that satisfies the con- 
straints of equation (9). This is because cutting plane QP solvers such as LOQO 
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Fig. 3. Optimization begins with all points seen by a camera placed at the same loca- 
tion. Shown are iterations 0, 1, 3,9,11, and 17. 



require an initial point within the convex constraint set. We set the initial iterate 
to have all unobserved trajectory points at the origin, and all observed trajectory 
points at one meter along the optical axis of the camera that observed it. 

Figure 3 shows the estimated trajectory as it is refined. In early iterations, 
the likelihood term lines up the trajectory points along the ray from the camera 
optical center to the true target location. But initially, their distances from the 
optical centers are mis-estimated. The prior pulls the trajectory towards the right 
distance in subsequent iterations. Despite the mismatch between the dynamic 
models used in synthesis and estimation, the estimated trajectory is close to 
the true trajectory. Figure 4 shows the final answer of several more synthesized 
problems. 

The field of view constraints are critical in recovering the trajectories. With- 
out them, the dynamics pull the trajectory infeasible solutions. Figure 5 shows 
a sample trajectory estimated without the constraints. 

7 Results: Real Data 

We have implemented this system on a sensor network of wireless cameras with 
person trackers on board. Each node nodes in our network is a personal digital 
assistant computer equipped with low resolution camera. These devices have 
wireless network adaptors that allow them to communicate with a MATLAB 
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Fig. 4. More synthetic results. Notice that the ends of the trajectories do not match 
up with the ground truth. For these points, velocity information can only come from 
past points. Points in the middle of the trajectories, on the other hand, beneht from 
information from the past as well as the future. 




Fig. 5. Without the field of view constraints, the dynamics can pull the trajectories 
behind the cameras. 



process running on a base station. The real-time person tracker runs on each 
PDA and reports the time-stamped horizontal location of a person to the base 
station every 250 millisecond. 

We use background subtraction to locate a target within each image plane. 
Foreground pixels are clustered according to their image coordinates using EM 
on a mixture of Gaussians. Each cluster corresponds to one person in the field 
of view of the camera. The clusters have a prior covariance that matches the 
aspect of a human. This coalesces nearby small blobs into human-sized blobs 
and filters out isolated small blobs [7] . For each cluster, an appearance feature 
vector is computed and used to identify the person it represents. The identity 
of the person along with the horizontal component of the center of cluster is 
transmitted to a central processing station. 

Before experimenting with non-overlapping cameras, we built a model of hu- 
man motion using a pair of overlapping cameras. Stereopsis between the two 
cameras allowed us to recover real-world trajectories without knowledge of dy- 
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Fig. 6. IPAQ handheld computers equipped with a camera and a wireless adaptor serve 
as our sensor nodes. They are mounted on the walls in our office building. 



namics. A system identification procedure was applied to these trajectories to 
recover parameters A and of the dynamic model. We estimated by aver- 
aging the measurement error at various known locations in the room, at various 
target speeds. 

A network of 4 PDAs observed a section of our floor. The cameras were 
mounted perpendicular to walls, so their orientations 0* was easy to determine. 
We used the floor plan of our building to determine the location p® of each 
camera. None of the fields of views overlapped. One test subject walked in the 
environment for about one minute, beginning and ending at the same place. 
Figure 7 sketches the actual trajectory and plots the recovered trajectory. The 
small dots on the trajectory correspond to each discrete time step in the system 
(l/4th of a second apart). 

Notice that the loop seen by camera 1 was successfully recovered. The for- 
ward and backward legs in this loops are correctly found to be at different depths. 
Without dynamics, it would have been impossible to determine that one path 
is at a different distance from the other. The long legs towards and away from 
camera 3 are not correctly recovered. It was impossible for the system to deter- 
mine the motion of the target in this region because the subject moved along 
the same bearing on those legs. Notice also that the estimated trajectory goes 
through a wall between camera 3 and 4. Had we encoded these walls in the form 
of additional constraints, the legs through camera 3 might have been correctly 
estimated. 
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Fig. 7. Estimated real path. 



8 Conclusion 

We have shown that some side information about the dynamics of a moving 
object can compensate for the lack of simultaneous correspondence between 
two cameras. Our method finds the trajectory that is most compatible with 
both the observations from the cameras and the expected dynamics. By using a 
convex visibility constraint, finding this trajectory can be expressed as a series 
of quadratic programs. 

The method presented in this paper also obviates the need for a visible ground 
plane. This paper has focused on horizontally mounted cameras, but we plan to 
allow input from obliquely mounted cameras in the future. 

As section 7 showed, known obstacles in the environment can provide helpful 
information in constraining the solution. We are working towards adding such 
constraints. Finally, the method we propose is batch. Batch processing is useful 
for this problem because the uncertaintly in the trajectory between observations 
can be very large. One could run this procedure in small time windows to obtain 
a time-lagged version. 
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Abstract. Dimensionality reduction is an essential aspect of visual pro- 
cessing. Traditionally, linear dimensionality reduction techniques such as 
principle components analysis have been used to find low dimensional 
linear subspaces in visual data. However, sub-manifolds in natural data 
are rarely linear, and consequently many recent techniques have been de- 
veloped for discovering non-linear manifolds. Prominent among these are 
Local Linear Embedding and Isomap. Unfortunately, such techniques cur- 
rently use a naive appearance model that judges image similarity based 
solely on Euclidean distance. In visual data. Euclidean distances rarely 
correspond to a meaningful perceptual difference between nearby images. 
In this paper, we attempt to improve the quality of manifold inference 
techniques for visual data by modeling local neighborhoods in terms of 
natural transformations between images — for example, by allowing im- 
age operations that extend simple differences and linear combinations. 
We introduce the idea of modeling local tangent spaces of the manifold 
in terms of these richer transformations. Given a local tangent space 
representation, we then embed data in a lower dimensional coordinate 
system while preserving reconstruction weights. This leads to improved 
manifold discovery in natural image sets. 



1 Introduction 

Recently there has been renewed interest in manifold recovery techniques moti- 
vated by the development of efficient algorithms for finding non-linear manifolds 
in high dimensional data. Isomap [1] and Local Linear Embedding (LLE) [2] are 
two approaches that have been particularly influential. Historically, two main 
ideas for discovering low dimensional manifolds in high dimensional data have 
been to find a mapping from the original space to a lower dimensional space 
that: (1) preserves pairwise distances (i.e. multidimensional scaling [3]); or (2) 
preserves mutual linear reconstruction ability (i.e. principle components analysis 
[4]). In each case, globally optimal solutions are linear manifolds. Interestingly, 
the more recent methods for manifold discovery. Isomap and LLE, are based on 
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exactly these same two principles, with the generalization that the new meth- 
ods only seek manifold descriptions that locally preserve distances and linear 
reconstructions. In this way, they avoid recovering linear global solutions [1,2]. 

There have been many new variants of these ideas [5, 6, 7, 8]. Although these 
techniques all produce non-linear manifolds in different ways, they are gener- 
ally based on the core assumption that, in natural data, (1) Euclidean distances 
locally preserve geodesic distances on the manifold [1], or (2) data objects can 
be linearly reconstructed from other data points nearby in Euclidean distance 
[2]. However, these core notions are not universally applicable nor always effec- 
tive. Particularly in image data it is easy to appreciate the shortcoming of these 
ideas: For images, weighted linear combinations amount to an awkward trans- 
formation whereby source images have their brightness levels adjusted and then 
are summed directly on top of one another. This is often an unnatural way to 
capture the image transformations that manifolds are intended to characterize. 
Figure 1 shows that centered, cropped and normalized target images can be rea- 
sonably well reconstructed from likewise aligned source images, but that even a 
minor shift, rotation or rescaling will quickly limit the ability of this approach 
to reconstruct a target image. Similarly, measuring Euclidean distances between 
images can sometimes be a dubious practice, since these distances do not always 
correspond to meaningful perceptual differences. 













Fig. 1. Least squares reconstructions of a target image (far right) from three nearby 
images (far left). The intermediate (fourth) image shows the best linear reconstruction 
of the rightmost image from the three leftmost images. First row: original reconstruc- 
tion. Second row: reconstruction of same image after translations have been applied. 



We propose to model manifolds locally by characterizing the local trans- 
formations that preserve the invariants they encode. That is, we attempt to 
characterize those transformations that cause points on the manifold to stay on 
the manifold. Our approach will be to first characterize the local tangent space 
around a data object by considering transformations of that object that cause 
it to stay on (or near) the manifold. 

Other work on incorporating natural image transformations to better model 
visual data has been proposed by [9,10,11,12]. However, this previous work pri- 
marily concerns learning mixture models over images rather than sub-manifolds, 
and most significantly, requires that the image transformations be manually spec- 
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ified ahead of time, rather than inferred from the data itself. In this paper, we 
infer local transformations directly from the image data. 

Eigentracking [13], also considers affine transformations of a set of precon- 
structed basis images for an object based on preliminary views. Here we consider 
a potentially richer set of transformations and simultaneously learn the basis in 
addition to the transformations and embedding. 

2 Local Image Transformations 

For images, it is easy to propose simple local transformations that capture natu- 
ral invariants in image data better than simply averaging nearby images together. 
Consider a very simple class of transformations based on receptive fields of pixel 
neighborhoods: Given an ni x ri 2 image x, imagine transforming it into a nearby 
image x = T{x,9), where for each pixel Xi G x we determine its value from 
corresponding nearby pixels in x. Specifically, we determine Xi according to 

Xi = 0^XN(i) ( 1 ) 

where N{i) denotes the set of neighboring pixels of pixel Xi. Thus T(-, 9) defines 
a simple local filter passed over the image, parameterized by a single weight 
vector 9, as shown in Figure 2. 




Fig. 2. Illustration of local pixel transformation from the left image to the right 



Although this defines a limited class of image transformations, it obviously 
enhances the image modeling capabilities of weighted image combinations (which 
are only based on adjusting the brightness level of source images). Many useful 
types of transformation such as translation, rotation and blurring can be ap- 
proximated using this simple local transformation. Figure 3 shows that similar 
images can be much better reconstructed by simple filter transformations rather 
than merely adjusting brightness levels prior to summing. Here minor trans- 
lations and appearance changes can be adequately modeled in circumstances 
where brightness changes fail. 

3 Local Tangent Space Modeling 

The key to our proposal is to model the local tangent space around high- 
dimensional data points by a small number of transformations that locally pre- 
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Fig. 3. Least squares reconstructions of a target image (far right) from three nearby 
images (left). The intermediate (fourth) image shows the best least squares recon- 
struction of the rightmost image from the three leftmost images. First row: standard 
reconstruction. Second row: reconstruction after local transformations. 



serve membership in the manifold. Thus, in our approach, a manifold is locally 
characterized by the invariants it preserves. 

We model transformations over the data space by using an operator T{x,6) 
which combines a data object x and a parameter vector 9 to produce a trans- 
formed object X = T{x,9). In general, we will need to assume very little about 
this operator, but, by making some very simple (and fairly weak) assumptions 
about the nature of T, we will be able to formulate natural geometric properties 
that one can preserve in a dimensionality reducing embedding. 

First, we assume that T is a bilinear operator. That is, T becomes a linear 
operator on each argument when the other argument is held fixed. Specifically, 

T{ax\ + bx 2 , 9) = aT{xi, 9) + bT{x 2 , 9) 

T{x, a9i + 6 ^ 2 ) = aT{x, 9i) + bT{x, ^ 2 ) (2) 

Second, we require the operator to have a local origin u) in the second argument 
that gives an identity map: 



T{x,uj) = x for all a: (3) 

With these properties, we can then naturally equate parameterized transforma- 
tions with tangent vectors as follows. First note that T{x,9) = x + T{x,S) for 
6 = 9 — u>, since by bilinearity we have 

T(x, 9) = T{x,uj + S) = T{x,uj) + T{x, S) 



and also 

T{x, oj) = X 

Thus, we can interpret every transformation of an object a; as a vector sum. 
That is, if a: = T{x, 9) then the difference a: — a; is just T(x, 6). 

Now imagine transforming a source object Xi to approximate a nearby target 
object Xj, where both reside on the manifold. The best approximation of Xj by 
Xi is given by 
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where 

Oij = argmin \\xj - T{xi, 6»)|| 

0 

If the approximation error is small, we can claim that the difference vector 
Xij — Xi = T{Sij), for Sij = 9ij — oj, is approximately tangent to the manifold at 
Xi- One thing we would like to preserve is the transformation distance between 
nearby points. Consider the norm of the difference vector: 

\\xi-Xij\\ = \\T{x,Sij)\\ = \\Sij\\\\T{x,ij^j)\\ 

where fjij = Here T{x,fjij) gives the direction of the approximate 

tangent vector at Xi, and ||<5y|| gives the coefficient in direction fjij. This says 
that Xij is the projection of Xj onto the tangent plane centered at Xi, since 
Xij = Xi + \\Sij\\T{x,fjij) is the best approximation of Xj in the local tangent 
space of Xi- 

Intuitively, when we embed Xi and Xij in a lower dimensional space, say by 
a mapping Xi i — jji and Xij i— yij, we would like to preserve the coefficient: 

Wvi-VijW « ll<5*jll 

That is, in the lower-dimensional space, the vector yi — yij encodes the embedded 
direction of the transformation, T{xi, fjij), and the length \\yi — fjij || encodes the 
coefficient of the transformation, 

4 Transformation-Invariant Embedding Algorithm 

Consider a set of t vectors, Xi, of dimension n sampled from an underlying man- 
ifold. If the manifold is smooth and locally invariant to natural transformations, 
we should be able to transform nearby points on the manifold to approximate 
each other. Therefore, in the low dimensional embedding we would like to pre- 
serve the ability to reconstruct points from their transformed neighbors. First, to 
identify the local neighborhood of each data point Xi, we compute the best point- 
to-point approximations using the local transformation operator described above 
(as opposed to just using Euclidean distances as proposed in LLE and Isomap). 
That is, given a target image Xj and a source image Xi, the best approximation 
of Xj from source Xi is given by 

Xij — T(^Xi,9ij) 



where 

9ij = argmm \\xj - T{xi, 6»)|| 

Given these quantities, the neighborhood of an image Xj can then be approxi- 
mated by selecting the K nearest neighbors Xi according to the K best approx- 
imations among the transformed reconstructions Xij . 

Second, to characterize the structure of the local neighborhood, we re-express 
each data point Xj in terms of its K nearest reconstructions Xij. Consider a 
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particular image xj with K nearest neighbors Xij and reconstruction weights 
Wij. The reconstruction error can be written as: 

^ 2 

ej{wj)= Xj-^Wijx'ij 
i=l 

where wj is the vector of reconstruction weights for an image Xj in terms of its 
neighbors. Note that each data point xj is reconstructed independently. That 
is, we can recover each set of weights separately by solving a system of n linear 
equations in K unknowns. This can be expressed in a standard matrix form 

= WxjWj - NjWj\\^ = ||(xj - Nj)wj\\^ = wjGjWj 

where \j is th® matrix of columns Xj repeated K times, Nj is the matrix of 
columns of K nearest reconstructions Xij of Xj, and Gj = (xj ~ ^j)'^iXj ~ ^j)- 
Note that, as with LLE, we wish to preserve scale and translation invariance 
in the local manifold characterization, and therefore we impose the additional 
constraint that the reconstruction weights Wj of each point Xj from its trans- 
formed neighbors sums to one. That is, Wij = 1 for all j. The rationale for 
this constraint is that we would like the reconstruction weights to be invariant 
under the mapping from the neighborhood to the global manifold coordinates, 
which can be shown to hold if and only if all rows of the weight matrix sum to 
one [2] . Therefore, imposing the extra constraint ensures that the reconstruction 
holds equally well in both high dimensional and low dimensional spaces. To show 
that the resulting constrained least squares problem can still be solved in closed 
form, introduce a Lagrange multiplier A and let e be a column vector of ones, 
obtaining 

L{w, A) = iiF Gw + A(w^e — e) 

— = 2Gw -I- Ae = 0 
aw 

Gw = Ce 

In practice, we can solve this with G set arbitrarily to 1 and then rescale so w 
sums to 1. 

Finally, we need to embed the orginal points Xj in the lower dimensional 
coordinate system by assigning them coordinates yj. Here we follow the same 
approach as LLE and choose the d dimensional vectors yj to minimize the em- 
bedding cost function 

j=l i=l 

This ensures that we maintain the reconstruction ability in the coordinate system 
of the lower dimensional manifold. To solve for these coordinates, re-express the 
cost function in a standard matrix form 

t 

^Y) = Y,\\YI^-Yw£ 
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where Ij is the column of the identity matrix, and Wj is the column of 
W . Then we obtain 

t 

min ||y/j — = min trace{YMY"^) 

j=i 

where M = {I — WY^ {I — W). As observed in [2] the solution for Y can have 
an arbitrary origin and orientation, and thus to make the problem well-posed, 
these two degrees of freedom must be removed. Requiring the coordinates to be 
centered on the origin yj = 0), and constraining the embedding vectors to 
have unit covariance = I) ^ removes the first and second degrees of free- 

dom respectively. So the cost function must be optimized subject to additional 
constraints. Considering only the second constraint for the time being, we find 
that 

L{Y, A) = YMY^ + X{YY'^ - {N - 1)1) 

diL rp 

— = 2MY^ + 2XY = 0 

dY 

MY'^ = AF^ 

Thus L is minimized when the columns of F^ (rows of F) are the eigenvectors as- 
sociated with the lowest eigenvalues of M. Discarding the eigenvector associated 
with eigenvalue 0 satisfies the first constraint. 



5 Experimental Results 

We present experimental results on face image data. The first two experiments 
attempt to illustrate the general advantages of the proposed technique. Transfor- 
mation Invariant Embedding (TIE), for discovering smooth manifolds, at least 
in simple image analysis problems. A subsequent experiment attempts to show 
some of the advantages for TIE in a face recognition setting. In all experiments 
we use the transformation operator on images ( 1) that was described in Section 2. 

Our first experiment is on translated versions of a single face image, as shown 
in Figure 4. Although the data set is high dimensional (the images are comprised 
of many pixels), there is clearly a one dimensional manifold that characterizes 
the image set. Figure 4 shows the result of running LLE and TIE on the original 
data set shown at the top. The results show that the 1-dimensional manifold 
discovered by LLE is inferior to that discovered by TIE, which had no problem 
tracking the vertical shift in the image set. 

We then conducted an experiment on a database of rotating face images. Fig- 
ure 5 shows the two-dimensional manifold discovered by LLE, whereas Figure 5 
shows the two-dimensional manifold recovered by TIE. In both cases, the first 
dimension (top) captured the rotation angle of the images, although once again 
LLE’s result is not as good as TIE’s. Interestingly, TIE (and to a lesser extent 
LLE) learned to distinguish frontal from profile views in its second dimension. 




526 A. Ghodsi, J. Huang, and D. Schuurmans 




Fig. 4. Top: Original data. Middle: 1-dimensional manifold discovered by LLE. 
Bottom: 1-dimensional manifold discovered by TIE. (Images are sorted by the 1- 
dimensional j/-coordinate values assigned by LLE and TIE respectively.) 









Fig. 5. Two-dimensional manifold discovered by LLE. Top two rows show first dimen- 
sion, bottom two rows show second dimension. 









Fig. 6. Two-dimensional manifold discovered by TIE. Top two rows show first dimen- 
sion, bottom two rows show second dimension. Note: first dimension captures rotation, 
whereas second captures frontal views versus side views. 



Transformation-Invariant Embedding for Image Analysis 527 



Fig. 7. 105 rotated face images of 7 subjects 



Finally, we conducted an experiment on a database of face images that con- 
tains 105 face images of 7 subjects which includes variations in both pose, and 
lighting (see Figure 7). The original data space was embedded into three dimen- 
sional subspaces. Figures 8 and 11 show the first dimension discovered by LLE 
and TIE respectively. Similarly, Figures 9 and 12 show the second dimension for 
LLE and TIE; and Figures 10 and 5 show the third dimension. 

Note that for TIE the first (Figure 11) and second (Figure 12) dimensions 
corespond to rotation and frontal and profile views, whereas TIE essentially 
learned to distinguish faces in its third dimension (Figure 5). Here, two indi- 
viduals were confused by TIE, whereas the other subjects were separated very 
well. 

The corresponding results for LLE are clearly inferior in each case. Figures 8, 
9 and 10 illustrates that LLE failed to discover smooth rotations, frontal versus 
side views, and identity. 

6 Conclusion 

In many image analysis problems, we know in advance that the data will incor- 
porate different types of transformations. We introduce a way to make standard 
manifold learning methods such as LLE invariant to transformations in the input . 
This is achieved by modeling the local tangent space around high-dimensional 
data points by a small number of transformations that locally preserve member- 
ship in the manifold. Thus, in our approach, a manifold is locally characterized 
by the invariants it preserves. 
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Fig. 8. First dimension of the three-dimensional manifold discovered by LLE 









Fig. 9. Second dimension of the three-dimensional manifold discovered by LLE 











Fig. 10. Third dimension of the three-dimensional manifold discovered by LLE 
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Fig. 11. First dimension of the three-dimensional manifold discovered by TIE 
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Fig. 12. Second dimension of the three-dimensional manifold discovered by TIE 




Fig. 13. Third dimension of the three-dimensional manifold discovered by TIE 
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We model transformations over the data space by using a bilinear opera- 
tor which produce a transformed object, and show that by making this fairly 
weak assumption about the nature of operator, we will be able to formulate 
natural geometric properties that one can preserve in a dimensionality reducing 
embedding. 

Although our basic approach is general, we focused on the special case of 
modeling manifolds in natural image data with emphasis on face recognition 
data. Here the proposed a simple local transformations capture natural invariants 
in the image data better than simply averaging nearby images together. Although 
we have focused solely on facial rotation and translation as the basic invariants 
we have been attempting to capture, clearly other types of transformations, such 
as warping, and out of plane rotation, are further phenonenon one may with to 
capture with these techniques. 
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Abstract. We analyze the least-squares error for structure from mo- 
tion (SFM) with a single infinitesimal motion (“structure from opti- 
cal flow”). We present approximations to the noiseless error over two, 
complementary regions of motion estimates: roughly forward and non- 
forward translations. Experiments show that these capture the error’s 
detailed behavior over the entire motion range. They can be used to 
derive new error properties, including generalizations of the bas-relief 
ambiguity. As examples, we explain the error’s complexity for epipoles 
near the field of view; for planar scenes, we derive a new, double bas- 
relief ambiguity and prove the absence of local minima. For nonplanar 
scenes, our approximations simplify under reasonable assumptions. We 
show that our analysis applies even for large noise, and that the projec- 
tive error has less information for estimating motion than the calibrated 
error. Our results make possible a comprehensive error analysis of SFM. 



1 Introduction 

A structure-from-motion (SFM) algorithm has two tasks: matching the 3D fea- 
tures across different images, and estimating the camera motion and 3D struc- 
ture. This paper reports progress toward a comprehensive analysis of estimation. 

Under standard assumptions, the goal of an “optimal” estimation algorithm 
is to find the minimum of the least-squares image-reprojection error [8], and 
the shape of this error as a function of the estimates determines the intrinsic 
problem that the algorithm solves. Here, we analyze this shape for SFM with a 
single infinitesimal motion (“structure from optical flow”). 

Little is known about the least-squares error. Yet, without understanding 
it, one can’t predict when algorithms will succeed or fail — for instance, when 
bundle adjustment [24] will find the optimal least-squares estimate rather than 
a bad estimate at a false local minimum. Given some understanding, algorithms 
can avoid local minima and compute estimates more reliably, as shown in [18] [3]. 

Previous research on estimation (as opposed to geometry) in SFM focussed on 
the bas-relief ambiguity [1] [4] [22] [10] [14] [7] [18] [3] [23] [9] [6] . Other results include 
the proof in [3] that the error is singular when the epipole estimate coincides 
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with an image point, and a semi-quantitative description [18] of the error over a 
linear slice through the plane of all epipole estimates. None of this work comes 
close to giving a detailed picture of the least-squares error. 

In this paper, we present approximations to the noiseless error over two, 
complementary regions of motion estimates: roughly forward and non-forward 
translations. Together, these approximations describe the whole error. They re- 
produce its detailed shape, yet are simple enough to be useful for understanding 
it. We believe that they make it possible to study the least-squares error in 
depth, and we illustrate this by deriving several new properties of the error. 

As in many previous analyses, e.g., [7] [14] [3], our theoretical discussion as- 
sumes infinitesimal motion and zero noise. Experiments show that the theory 
also works for large noise. We study calibrated cameras, taking the focal length 
as 1 without loss of generality, and also present results for projective SFM. For 
lack of space, all proofs are omitted. They can be found in [17]. 



1.1 Preliminaries 



The standard least-squares error for infinitesimal motion (or optical flow) is [14] 






Fls(T,u;,{Z})= ^ 



m—1 



{TzPm [Tx'jTy]) 



War<“) (pm) 

a^{x,y,z} 



( 1 ) 



Here Np is the total number of scene points, Pm = Pim = 2/m) is the mth 

image point in the first image, d™ = p 2 m — Pim is the mth measured flow 
from image 1 to 2, the Zm are the 3D depth estimates, T is the translation 
estimate, oj = is the estimate of the infinitesimal rotation, and 

the ri^i (p), (p), r^^i (p) are the rotational flows at the image point p due 

to unit rotations around the x, y, or z axes: r (p) = 



'r(D (p),r(«) (p),r(^) (p)' 




-xy 

- (1 + 2/^) 



1 + x"^ 
xy 




G 

( 2 ) 



We study an effective error E (e) = min^^j Fls (T, w, {Z}), with e the epipole. 

Definition 1. 



Define the cross-product for vectors v, v' G by v x v' = v^Vy — Vyv'^. 

Define the error vector e G 5?^^ by 

Cm (e) = ^ X d„. (3) 

|pm ^1 

Define the 3 rotational contributions to e, (e) G a G {x,y,z}: Let 

= {{pm - e)/ |p^ - e|) xr(“) (p„) and E (e) = ] G 

Define the projection 17(e) = 1^^, — '7(e) (>7^ (e) <7 (e)) ^ '7’^ (e) G 
where Iat^ denotes the Np x Np identity matrix. 
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Proposition 1. [19][20] Assume the candidate epipole e does not coincide^ 
with any image point Pm- Then 

E (e) = min (T, w, {Z}) = (e) 77 (e) e (e) . (4) 

{Z},U1 

Remark 1. The definition of 77 (e) shows that it cancels the rotational contri- 
butions to e. Thus, for noiseless data E {■) does not depend on the value of the 
true rotation Wtrue, and we are free to take Wtrue = 0 in analyzing it. We do this 
for the rest of the paper, without loss of generality. 

For noiseless images, we get a more explicit expression for E (e) by sub- 
stituting the ground truth for the flow into the result of Proposition 1: 

Proposition 2. Assume Vm, e yf Pm, as in Proposition 1. Then 

E (e) = g • I 2 — rmf?) (e — Otrue)^ , (5) 




( 6 ) 

and e = Pm ^m,e = (Pm ^) / |Pm V = ( Vy'., Vx^ • 

Remark 2. Each summand in (5) is proportional to |e — Otruel^, so E (e) is con- 
tinuous at e = Otrue- This gives a direct proof of the result of [3]. 

2 Forward Motion: e in or near the Image 

We first analyze E (e) for candidate epipoles in or near the image, with |e| < 
0pov/2 radians, where 6-pow gives the angular extent of the image points. We 
refer to this as the forward region. The true epipole is not constrained. 

Previous results. For e near the image points, [18] [3] show that E (e) typically 
is complex and has local minima. Also, [3] proved: E (e) is singular when e 
coincides with an image point and e yf etrue! E (e) is continuous at e = etrue- 
The singularity is not enough to explain the minima: The error can be singu- 
lar at an image point and yet behave smoothly a short distance away (Figure Id). 
To explain them, one must understand what causes the singular effects to extend 
far from the image points, so that effects from different points can interact. 

To state this another way, the error’s singularity at an image point reflects 
the known sin^ 9 dependence on the angle between the hypothesized epipolar 
direction and the observed translational flow. Thus, the singularity at the image 
point comes from a known property of the error around an image point and does 
not give a new explanation of the error’s behavior around the point. 

^ If it does, the expression for min^zy.ix Ej^s {T,lo, {Z}) must be modified slightly. 
Strictly, our formulas in e aren’t valid at Ttrue.z = 0, but they are easily extended. 
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Fig. 1. Contour plot of error for e in the field of view. The structure comes from PUMA 
[12] and etme = (0.16; 0.37). The 3D depths are shown at the image points, (a): E{e), 
with 4 minima at ‘x’; (b): Projective error, with 4 marked minima; (c): E{e) for noisy 
images, with Unoise = 0.2dmed (see Fig. 7); (d) Closeup of E{e) around an image point, 
showing that the singularity quickly becomes invisible. 



2.1 Forward Analysis 

Remark 3. The singularity of E (e) at an image point causes it to have two 
local minima on an infinitesimal circle around the point. In a region where E (e) 
behaves smoothly, it has a single minimum on an infinitesimal circle. Thus, we 
analyze E (e), and its minima, on small circles centered on the image points. Let 
Pk be the radius of the circle Ck around pfc . A particular limit turns out to give 
a useful approximation (Linear in Proposition 3). 



Definition 2 (Near— point limit). Define the limit pk 0 by {pk — > 
0,Np — > oo for fixed pfcAp and OpQy), where we stipulate that image-point 
sums 6* (-^p) t oo. 
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Remark 4-. For real images, the stipulation amounts to assuming that sums over 
the image points have no unexpected cancellations. This holds unless the image 
points cluster near a line or the 3D depths have a few outliers at small depths. 

Proposition 3 (Linear)- In the limit pk 0, we have the asymptotic estimate 
E (e) ~ Linear (^) O {^Pk: Up ) 071 Ck, whcTC Linear (®) = ^ |Pfc ®true | ^ 

cNp + {pkNp) 

(7) 

The 0(1) c G 3 ?, a, a„G Q G 3 ?^^^ don’t depend on e, jp^ — Otmel; PkNp- 

We rewrite our approximation as Linear = 7 +acos {9 — (j)i) + (3 cos“^ {9 — (j) 2 ) , 
where a, j3 give the linear and quadratic terms in (7), and (cos0; sin0) = Ak^e- 

Lemma 1. Let f (9) = acos^ {9 — <j)i) + cos {9 — <j) 2 ). For any values of 4>\, 4>2, 
the function f{9) has one minimum for |a| < 1/2 and two minima for |a| > 1. 

Thus, the value of \(3/a\ determines how many minima Linear has on the 
circle Ck and, from Remark 3, the rate of decrease of \(3/a\ with pk (i.e., with 
the distance from p^) determines how far the singular effects due to p^ extend. 

Experiments. We compared Linear with the true error’s behavior for 1200 syn- 
thetic flows generated from real structures. We measured the singularity of E (e) 
on a circle by: the number of its local minima, and the ratio of its second funda- 
mental (3rd Fourier coefficient) to its standard deviation. This second measure 
indicates the singularity’s size. Figures 2a,b verify that all but a small frac- 
tion (3%) of the one-minimum results have \(}/a\ < 1, and all but 1.7% of the 
two-minimum results have \(3/a\ > 1/2. Figure 2c shows that the “size” of the 
singularity grows roughly linearly with |/3/a| until it saturates. These results 
demonstrate that our analysis predicts the error’s behavior very well. 

One can use Linear to understand the factors causing the error’s complexity 
[17]. Figure 3 confirms our predictions from (7): the error behaves smoothly 
near image points close to etme (Fig. 3a), and is more likely to have a complex 
behavior near an isolated image point (Fig. 3b) or one with extreme 3D depth 
(Fig. 3c). Also, experiments show that the fraction of “singular” results decreases 
roughly like , and the size of the singular fluctuations in the error decreases 
roughly like N~^, in agreement with the behavior of |/3 /q;|, see [17]. 

3 Sideways Motion: |e| > 0pov/2 

Preliminaries. Define A = Ttme.z (e x etme), B = Ttme.z (1 - e • etme/ je]), 
with e = e/ jej. The A, B capture all dependence on the true translation Ttrue- 
It is convenient to use an image coordinate system that rotates around the 
image center with the candidate epipole e such that e = ( jej ; O). 
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Fig. 2. Histograms of two measures of the complexity of -E(e). (a) |/3/a|, one minimum 
results; (b) Two minima results, separately for |/9/a| < 2 and \P/a\ > 2. (c) Normalized 
second fundamental of E (e) for \P/q\ < 5. 




Fig. 3. Histograms, plotted separately for circles Ct where -E(e) had two minima (dot- 
ted curves) and one minimum, (a) Epipolar-distance ratio (A'pPfc)~^|pfc — etrue|; (b) Iso- 
lation measure pk Ylmjtk IP"* “ (c) 3D depth ratio |.Zfcdtrue|/ max,„ |.Zmdtrue|, 

where Zk = - rfel7) and dtrue = (pfc - etrue)/|pfe - etrue| 



We represent the inverse depths as a sum of a linear component and a 
nonlinear component. We write = nz+nxXm + nyym + Z^^ where Z^^ ^ 

is the nonlinear and ZJ^\^ = Uz + UxXm + nyUm is the linear component of the 
structure, and where we define these components uniquely from 



«=i;zi 



-1 

NL.m 






7-1 






7-1 



( 8 ) 
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We refer to n= {nx',ny;riz) above as the planar normal, since = nz + 
nxXm + for a planar scene neglecting noise. We define the planar epipole 

n = {ux', Uy) juz by analogy with the epipole e, and fiz = Uz — nx/\e\- 

Definition 3 (Limit of zero field of view (FOV)). Let O-poY be the angular 
extent of the region spanned by the image points. We define the zero-FOV limit 
by writing the image points as Pm = ApovPm taking Apov — f 0 keeping 
the and fixed. We denote the limit by 6*pov — > 0 or Apov — > 0- 

The classical result is on the bas-relief ambiguity [7] [14] [15]. 

Theorem 1 (Jepson/Heeger/Maybank (JHM)). Assume the image points 
do not lie on a line, and that e is finite and |e| > 0. In the limit of zero field 
of view, the noiseless least-squares error for infinitesimal motion is given by 
E (e) = (e X etrue) 'Yhra=l ^NL,m- 



Remark 5 (Limitations of the JHM Theorem). 

The JHM result models the error only for OpQy |e| 6 *pov and does not 
capture any of the error’s dependence on e = |e| (it cannot be used to analyze 
the minima); It gives no information about the error on the line e = fetrue, t G 
(— 00 , 00 ) — despite the fact that the true epipole lies on it; It says nothing about 
the error for [etrue] ~ O (Opov),’ It says nothing about the error for planar scenes 
or the effect on E from the linear scene component, which is always important. 



3.1 New Analysis 

Definition 4 (Sideways limit). The sideways limit e oq defined by 

(Apov — f 0, e — >00 for fixed n = eApov, P™, A B, hz,ny, , 

where the zero-FOV limit Apov — > 0 and the p))j are given in Definition 3. 



Theorem 2 (Main Theorem). The approximation inside (e) in (13) gives an 
asymptotic estimate of E in the sideways limit: 

E{e) - i?,ide(e) = O {E/e) (e **"^“"* 00 ) . (9) 



Remark 6. The sideways limit fixes h and B = Ttme.z (1 — etrue, 2;/e) which de- 
pend on e. We do this since we want iHside(e) to remain a good approximation 
when |n| and [etrue] are as large as e, and since this simplifies our approximation 
and makes it display the two-fold ambiguity of SFM for planes [14]. 
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To derive £iside(e), we neglect effects suppressed by factors of dpov and 
0FOv/|e|. First, we must “pre-subtract” the leading rotational contribution from 
e in (3). This is necessary since E{-) is given by a subtraction of two terms (due 
to the rotation cancellation from 77) and we need to ensure that its leading 
dependence comes from the leading dependencies of the individual terms. After 
this “pre-subtraction,” 77 must be replaced by 77j^, where the latter annihilates 
the remaining subspace of rotational contributions. Define 

A-“-> {|^ (!/* - <!/*»} . (10) 



ry(l,0 

= 



Ip-e 






77 



(side) 



Ip-e 






(12) 



where 77^®“^®^ equals 77 j_ evaluated in the sideways limit, {D} denotes a vector 
in with entries Vm, a = Auy — Bhz, P = Buy + Afiz. Then 

Aside = ~ (13) 

-2APZ^°'^'^ + 2AaZ^°'^^ + - 2AaZj)’^^ 

[17] gives explicit formulas for the , Z^^ , and Z'^^ in terms of the image 



and structure moments 

^ia,b = Ip™ “ ®l' 



— C ^ ] Z^i^ zn^mym/ 



- e 



Sa,b — e 5^ / IP™ ®l ' 



Discussion. Our result nicely separates the dependencies on the various param- 
eters. For example. Aside depends on etrue only through the quantities A and 
B. It depends on e just through B, hz, and the dot products , Z^^ , and 
where the first two are linear in e“^ and the dot products can be approxi- 
mated by simple ratios of quadratic expressions in e. One can easily read off from 
our formulas which contributions dominate at small FOV. For planar scenes (or 
the linear scene contribution), all the structure/motion unknowns appear in the 
leading factors; the A^ depend only on the known image coordinates. Aside can 
be shown to respect the planar two-fold ambiguity. 

Our result depends on the nonlinear part of the scene through the structure 
moments Oa,b^ for 2 < a -I- 6 < 3, and on Sc,d- Thus, the error’s dependence on 
the scene can be approximated using 15 parameters to describe the scene. Just 6 
are usually enough. Our expression for Aside often simplifies dramatically. This 
is because our approximation works for many types of scenes and motions, and 
we can often neglect most of the terms for a particular scene/motion. 
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Fig. 4. Sideways error. (1); Planar example with double bas-relief ambiguity, 
etrue= (— 6.9; 7.2) n = (0.62; 0.64); (2): Planar example showing lack of minima, 
etrue= (—0.69; 0.72) n = (0.81; 0.74); (3): Rocket structure [5], etrue = (—0.14; —0.045). 
(a): True E(e); (b): Simple planar approximation (3B): Simple approx. (15). 



3.2 Some Examples of Consequences 

Planar Scene, Non-forward true motion; large planar slant. Assume an im- 
age pair with 0pov "C 1 (small FOV), |etrue| 1 (sideways true motion), 
and |n| = \{nx',ny)/nz\ 1 (large slant). We assume e 1, excluding e < 

jetruel and e < |n| (large-e assumption). Then E {e) k. where ~ 

^trae z (® ^ etrue (© X u)^ and is roughly constant. The error is small on 

two lines — a double bas-relief ambiguity. Figure 4(1) illustrates this new effect. 

Planar Scene, Non-sideways true motion; small planar slant. Assume 0pov ^ 1 
(small FOV), 1 <C e (large e), |etrue| «C 1 (forward true motion), and |n| <C 
1 (small slant). Under these conditions, our approximation has no false local 
minima in a region e > Cthresh, where Cthresh 0(1). 

One can show that the derivative with respect to e of our asymptotic estimate 
gives an asymptotic estimate of the derivative of E{e). Thus, in the sideways 
limit E (e) has no false local minima for sufficiently large e. Figure 4(2) shows 
an example, comparing the true error to our simple approximation above. 

Symmetry. The image moments pLa,b divide into two categories: the even mo- 
ments, such as a^ 2 , 0 j /^o, 2 ) M 2 . 2 ) that involve sums over even powers and nonneg- 
ative terms only, and the odd moments. For randomly distributed image points, 
the odd /ia,b are suppressed by roughly 1/ compared to the even pLa,b- 
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With many correspondences, the image usually has some rotational symme- 
try and we can neglect the odd fXa,b to a good approximation. Then, E « inside 



2 e P- 0 , 2 ^ 2,2 

~ ^ ^ I 

e^M2.2 + Mo.2 

— 2A f/3(7o^2 ~f~ OL 



Eo,2 









(14) 



6 ^ 0,2 



e fj^2,2 + Mo, 2 



-l- j4^S'o,o + Sq^2 ~ 2Ai?S'o,i ~ 



CTyi ] +2B [ /3(To,3 - a 



e M 2 , 2 CTo ,2 — eMo, 2 CTi ,2 



e^M2,2 + Mo, 2 



(Aeayi - g (ecri ,2 + ^0,2))^ 
e^M2,2 + Mo,2 



- S 



2 ^1,1 
M2,0 



Also, symmetry makes the even ^a,b depend weakly on the epipolar direction e 
[17], which give a further simplification. 

In the same way as for the Ha,b, we can estimate the relative sizes of the 
structure-dependent moments S'o,a and <To,&. All the Ca.b will be small, for any 
direction e, if the have no good approximation in terms of a cubic poly- 
nomial {noncubic condition). Also, [17] argues that the mixed terms combining 
the nonlinear and linear structure components can often be neglected. Assuming 
this and the noncubic condition, we get the simple estimate 

E{e) « A,ide(e) « - 2a/3ig’') + A^^o.o + B^So, 2 - (15) 



Our experiments show that (15) accurately describes the error for our sequences. 

In addition to the conclusions above, [17] uses our estimate to generalize the 
JHM theorem [7] [14] and to extend the results of [18] to planar scenes. 



Experiments. We tested Aside against the true error, using synthetic structures 
and structures extracted from five real sequences (PUMA [12], Rocket [5], CMU 
CASTLE, and two of our indoor sequences). We show only a few results. 

Figure 5 compares A(e) to our simplest approximation (15), which has slight 
problems only for the PUMA example. Figure 5 also shows Aside for this example; 
it is indistinguishable from the true error. For the Rocket structure, we compared 
the global minimum positions for the true error. Aside, and (15). Within mea- 
surement error, they were identical. Fig. 6(1,2) shows that our symmetry-based 
approximation (14) gives good results with just 192 and 132 image points. 



4 Projective Geometry 

Suppose one fixes the camera matrix for image 1 to be ( 13 , 03 ). The projective 
transforms that leave this camera matrix unaltered change the structure by 
adding an arbitrary plane to it or scaling it [16]. The real projective ambiguity 
is just this freedom to scale or add a plane. Since scenes that differ only in their 
planar component are equivalent in projective geometry, the linear component 
Z^l^of the scene cannot make a contribution to the least-squares error. In effect, 
only the nonlinear terms contribute to Aside- 
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Fig. 5. (1): CMU Castle, etrue = (5.78; 8.16); (2): Indoor 1, etrue = (0.11; 0.16); 
(3); Indoor 2, etme = (-0.125; -1.49); (4): PUMA, e*™ = (10; -.025). 

(a): Simplified approximation (15); (b): True error; (C): Eside- 



[17] shows this directly. For projective SFM and infinitesimal motion, one can 
define a projective error iSproj (e) = e^.^proje as for the Euclidean case, where e 
is the same as in (3). The same arguments as before give a sideways asymptotic 
estimate. Ep^oj (c?) ss A 7Tj_pi.oj 

— 2AB {Znl}^ 77j_proj {2 /^Nl} + ^proj {y-^Np} • (16) 

Thus, the error for projective SFM is simpler than the Euclidean error. This 
simplicity comes at a cost [18]. At large e, one can show that the projective 
error on the line e = fetrue has no quadratic growth with e as in the Euclidean 
error. This implies that the projective error gives less information to estimate 
the epipole. Figure 6(3B) compares the Euclidean and projective errors for the 
same image pair. 
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(la) (lb) (2a) 




(3A) (3B) (2b) 



Fig. 6. (1): Extended PUMA sequence, etme = (0.13; —0.08); (2): Extended Rocket, 
Gtrue = (—1-0; —3.9); (3): Projective error for PUMA, same images as in Figure 5(4). 
(a): Approximation (14); (b): True error; (3A): True projective error; (3B): Projective 
(dashed) and Euclidean errors on the line = 0, the “bas-relief valley” in (3A). 



In the forward region, the projective analysis is similar to the Euclidean one. 
Experiments confirm that the results are also similar, see Figure 1. 

5 Noise 

We report experiments on noisy images. We ran a standard two-image algorithm 
to estimate the structure/motion and used the result to compute Eside-^ Egide 
continues to model the true error well, despite our using a larger than normal 
noise. The noise is large enough that our two-image routine usually returns bad 
T estimates and the noisy error looks quite different from the noiseless one. 

For noisy images, we cannot assume without loss of generality that the true 
rotation is zero. Fortunately, rotation has a small effect on the error [18] [25]. 

We have not studied the forward noisy error carefully, but experiments (e.g.. 
Figure 1) indicate that noise increases its complexity, as might be expected [17]. 

6 Conclusion 

We studied the least-squares error for infinitesimal motion, giving two simple 
asymptotic estimates of the error which capture its detailed behavior over the 

^ Our rationale is that the error depends on the observed flow, which is modelled 
better by the estimated structure and motion than by the ground truth. 
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Fig. 7. Noise results. (Jnoise gives the noise standard deviation, dmed the median size 
of the true flow, and ^xerr the angular error in in the initial T estimate. (1): Rocket, 
etrue = (-0.01; -0.05), (Tnoise = 0.4d,„ed, = 77°. (2): PUMA, etrue = (0.5; -0.2), 

cTnoise = O.Sdmed, ^Terr = 38°. (a) Noiseless error; (b) True noisy error; (c) Eaide- 



entire range of motions. We illustrated the use of these estimates by deriving 
new error properties. 

For roughly forward translation estimates, we showed by theory and exper- 
iment that the error tends to be complex for candidate epipoles near image 
points, and that this is more likely when: the true epipole is far from the point; 
and/or the point is isolated in the image; and/or the corresponding 3D depth is 
small; and/or the number of image points is small. Our experiments show that 
the complexity near image points produces local minima, confirming [3] [18]. We 
pointed out that the previous arguments of [3] [18] do not explain the error’s 
complexity or local minima. 

For non-forward translation estimates, we gave a simple model of the error 
for planar scenes. For two special cases, we derived a new double bas-relief ambi- 
guity and proved the absence of local minima at large |e|. For nonplanar scenes, 
we simplified our approximations under various assumptions, including rough 
rotational symmetry of the image and a reasonably “generic” distribution of 3D 
depths. Our simplest approximation gives a good model of the least-squares er- 
ror in all our noiseless experiments. We analyzed the error for projective SFM, 
pointing out that it is flatter than the Euclidean error, and showed by experi- 
ments that our analysis remains useful for noisy images. 

We believe that our results will lead to an in-depth understanding of the 
least-squares error. For example, our sideways asymptotic estimate depends 
on just 29 parameters, and often 13 are enough. This suggests that a semi- 
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exhaustive search through the space of least-squares errors may be feasible to 
determine the pitfalls that algorithms could encounter. 
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Abstract. We present a generative probabilistic model for 3D scenes with stereo 
views. With this model, we track an object in 3 dimensions while simultaneously 
learning its appearance and the appearance of the background. By using a genera- 
tive model for the scene, we are able to aggregate evidence over time. In addition, 
the probabilistic model naturally handles sources of variability. 

For inference and learning in the model, we formulate an Expectation Maximiza- 
tion (EM) algorithm where Rao-Blackwellized Particle filtering is used in the E 
step. The use of stereo views of the scene is a strong source of disambiguating 
evidence and allows rapid convergence of the algorithm. The update equations 
have an appealing form and as a side result, we give a generative probabilistic 
interpretation for the Sum of Squared Differences (SSD) metric known from the 
field of Stereo Vision. 



1 Introduction 

We introduce a generative, top-down viewpoint for tracking and scene learning. We 
assume that a scene is composed of a moving object in front of a background. The scene 
model is shown in Figure 1(a). Within this paradigm, we can simultaneously learn the 
appearance of the background and the object, while the object moves in 3 dimension 
within the scene. 

The algorithm is based on a probabilistic generative modelling approach. Such a 
model describes the scene components and the process by which they generate the ob- 
served data. Being probabilistic, the model can naturally describe the different sources 
of variability in the data. This approach provides a framework for learning and tracking, 
via the EM algorithm associated with the generative model. In the E-step, object posi- 
tion is inferred and sufficient statistics are computed; in the M-step, model parameters, 
including object and background appearances, are updated. 

Sensor fusion is another important advantage of the probabilistic generative mod- 
elling approach. Whereas a bottom-up approach would process the signal from each 
camera separately, then combine them into an estimate of the object position, our ap- 
proach processes the camera signals jointly and in a systematic fashion that derives from 
the model. 

The use of a stereo view of the scene turns out to be of significant value over the 
use of a monocular view. It allows the algorithm to locate and track an object, even 
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when the prior model of the object appearance is uninformative e.g. when initialized to 
random values. As a consequence, only a small number of EM iterations are required 
for convergence. 

In section 2 we discuss prior work and relate it to the current work. In section 3, we 
introduce the scene model. When the object moves within the scene, the connectivity of 
the graphical model for the scene changes. The connectivity is dictated by the geometry 
of the scene and is captured by the coordinate transformations that are discussed in sec- 
tion 4. In section 5 we discuss the Generalized EM algorithm, emphasizing the intuitive 
interpretation of the update equations of the E-step. Section 5.1 discusses the combi- 
nation of EM and particle filtering for inferring the location and learning appearances. 
Results for a video sequence are given in section 6. 

2 Related Work 

The work presented here can be viewed as drawing on and bridging the fields of 3D 
tracking[l,2], stereo vision[3] and 2-D scene modelling[4]. We briefly review related 
work in these fields and relate and contrast with the current work. 

Tracking an object in tree dimensions is useful for a variety of applications [5,6] 
ranging from robot navigation to human computer interfaces. Most tracking methods 
rely on a model of the object to be tracked. Object models are usually constructed by 
hand [7,1,2]. For example, Schodl et al. [2] use a textured 3D polygonal model and use 
gradient descent in a cost function. 

Our model is similar to these methods in that we use an appearance map of the object, 
and track it in 3 dimensions. These methods rely on strong prior models in order to do 
tracking from a monocular view. As we use a stereo view of the scene, our method does 
not require prior hand construction of the model of the object, e.g. the face, and we are 
able to learn a model. Once a model has been learned, one can track the object using 
only a monocular view. 

The objective of most stereo vision work has been to extract a depth map for an 
image. The evidence is in the form disparity between pixels in two or more views of 
the same scene [3,8,9]. Most stereo vision methods calculate a disparity cost based on 
this evidence, such as Sum of Squared Differences (SSD)[10]. In section 5.3 we offer a 
generative probabilistic interpretation for SSD. 

Frey and Jojic [4,11], and Dellaert et al.[12] use generative top-down models. 
They use layered 2D models, and learn 2D templates for objects that move across a 
background[13]. When using a monocular view from a single camera, learning the ap- 
pearance of objects that can occlude each other is a hard problem. By incorporate stereo 
views of a scene, we can resolve the identifi ability problem inherent with using a single 
camera and can more easily track an object in 3 dimensions. 

Recently, a great deal of attention has been paid to particle filtering in various guises. 
Blake et al.[14,15] use models based on tracking spline outlines of objects[16]. Other 
researchers have extended this to appearance based models [17]. As with the track- 
ing methods discussed before, the models are usually constructed by hand rather than 
learned. 

We use Rao-Blackwellized[18] particle filtering to track the position and orientation 
of an object within a scene. In Rao Blackwellized particle hltering, the model contains 
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random variables represented by parametric distributions as well as sampled random 
variables represented as particle sets. When performing inference over the sampled 
random variable, one must integrate over the parametric random variables. 

We extend the standard Particle Filtering[14] paradigm in two ways. First, we use 
particle filtering in conjunction with stereo observations to track an object in 3 dimen- 
sions. Secondly, unlike most tracking paradigms, we are also able to learn the appearance 
of the objects in the scene, as they move in the scene [19,18]. We believe this is the first 
demonstration of this algorithm for real data. 



3 The Stereo Scene Model 

The scene model is shown if Figure 1(a). The figure shows a background, a “cardboard 
cutout” object in front of the background that occludes part of it and two cameras. Figure 
1(b) shows the equivalent graphical model. We assume that the object will be seen at 
different locations in the two cameras due to stereo disparity and that the cameras are 
aligned such that the same background image is seen in both cameras. 




camera Right 
camera 



Background Object 




camera camera 



Location - 
Orientation 

Mask 

Appearance 



(a) Scene with stereo cameras. (b) Generative model for stereo obser- 

vations. 

Fig. 1. (a)Schematic of scene with stereo cameras, (b) Generative model for stereo observations 
of a scene with a single object that partially occludes and a background. 



In this graph, V° is the background image, is the object, is the transparency 
mask of the object, is a vector containing the position and orientation of the object, 
is the observed image in the left camera and Y” is the observed image in the right 
camera. 

The position variable is a continuous random variable which contains at least 3 
spacial coordinates of the object, allowing for 3D translation within the scene. 
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We use multivariate Gaussians with diagonal covariance matrices to model all ap- 
pearances. Hence, the appearance model of the background is 

= (1) 

3 

where is the value of pixel j, /i° is the mean, and 77 ° is the precision. 

The model for the object contains three components: a template, a transparency 
mask and a position. Again, the appearance is modelled by a multivariate Gaussian with 
diagonal covariance matrix. 



P(y^) =Y[N{vl;nl,r]l) ( 2 ) 

i 

where v] is the value of pixel i, n\ is the mean, and r]} is the precision. 

Pixels in the object model can be opaque or transparent. We use discrete mixing, i.e. 
a pixel is either completely opaqne or transparent. The prior distribution is 

p(O^) = + (1 -«*)(!- Oi)] . (3) 

i 

where Oi is the valne of pixel i and ai is the probability that the pixel is opaqne. 

The distribntion for the position/orientation random variable, is handled differently 
from other variables in the model. It is represented by a particle set. A particle set is a 
set of vectors {xg} where each vector (also called a particle) represents a position of the 
object and each particle is associated with a weight {q(xs)}. 

We use a Gaussian for the prior for the position of the object 



p{x'^) = (4) 

where is the mean and is the precision. This is used when generating the initial 
set of particles and for recovering particles that land outside the a bounding volume. 

When generating instances of the left and right camera images, we first sample 
from the backgronnd model, then we choose a position for the object and sample from 
the object appearance model. The appearance of the object is then overlayed on the 
background, for pixels where the object is opaque. For example, the value of the j-th 
pixel Uj in in the left image is 

y'j = ^k-,3) ■ 

In words, pixel yj takes the valne of the object pixel if it is opaque (i.e. = 1) 

or the value of the background v'j if it is transparent. Finally we add Gaussian pixel noise 
with precision A. Pixels in the right image are of conrse fonnd similarly. The function 
^{x,j) maps coordinates depending in the position of the object, and will be discussed in 
the next section. If we assume all variances are zero, the process of generating from this 
model is analogous to rendering the scene using standard computer graphics methods. 
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The prior distribution for a pixel in the left image yj is 



p(y'|V0,V\O\x) = 




= 1 
= 0 



(6) 



The complete probability distribution for the sensor images is the product of the distri- 
butions for the individual pixels, 






4 Coordinate Transformations 

The object can be at various locations and orientations. Hence, the mapping from coor- 
dinates on the object model to the image sensor will change. If the object is close to the 
camera, then each pixel on the object may map onto many pixels on the camera sensor, 
and if it is far away, many pixels map onto a single pixel in the camera sensor. 

We define a set of functions that map between coordinates in the various appearance 
models we will be using. We assume that the cameras are pinhole cameras, looking 
along the negative 2 axis. For example, if the distance between the two cameras is 10 
cm, then left eye is located at [—5, 0, 0]^ and the right camera is at [5, 0, 0]^. The map- 
ping is defined in terms of transformations of homogeneous coordinates. Homogenous 
coordinates allow us perform translations and perspective projections in a consistent 
framework and are commonly used in computer graphics. A point in homogenous co- 
ordinates includes a 4th component h, i.e. {x, y, z, h). Assuming a flat object V^, the 
transformation from the matrix indices of the object into the matrix indices of the left 
sensor Y*, is denoted as j I = i). This mapping is defined as 



indxi{Y\ jl) 
indxj{Y\jl) 
0 


= SM • PRS(x) 


• EYE(/)-W(x) • MO • 


indxi(V^,i) 
indxjiy^ ,i) 
0 


1 






1 



where indxi{Y^ ,i) denote the row index of pixel i in the object and indxj(V^ ,i) 
denotes the column index. Similarly, indxi{Y\jl) denotes the row index of pixel jl in 
the left sensor image, and indxj ( Y* , j7) denotes the column index. MO transforms from 
matrix-coordinates to canonical position in physical coordinates, W (x) transforms from 
canonical object position to the actual position x of the object in physical coordinates 
(relative to the camera coordinate system). EYE(() is the transformation due to the 
position of the left eye. In our case, it is simply a shift of 5 for along x for the left camera, 
and —5 for the right camera. PRS(x) is the perspective projective transformation, which 
depends on the distance of the object from the camera. SM maps from physical sensor 
coordinates to sensor matrix coordinates. 

To transform an observed image into the object, we map the matrix indices of the 
object through this transformation, round the result to the nearest integer, and then 
retrieve the the values of in the image matrix at those indices. 
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We will have a need for additional coordinate transformations: the inverse mapping 
of Eqn. (8) is i = (x,jl) which maps left sensor coordinates into object model 

coordinates. The function jr = i) maps from the coordinates of the object 

model into the left sensor matrix, and the inverse transformation is i = 

An interesting consequence of using stereo cameras and working in world coordinates 
is that coordinates have a physical meaning. For example, the matrix MO dehnes the 
physical resolution of the object appearance model. In our experiments, the physical size 
of one pixel on the surface of the object is about 1 cm x 1 cm. If only a single camera is 
used, it is not be possible to determine the scale at which an object should be modelled. 



5 EM-PF Algorithm for Learning Stereo Scenes 

Now we present an EM algorithm, that employs Rao Blackwellized particle hltering to 
compute approximations to the model posteriors in an approximate E-step. 

We employ two types of approximations to compute the model posteriors in the E 
step of the algorithm. The hrst approximation comes from the factorization of the graph, 
and the second from the approximation of the location posterior with a particle set. 



5.1 The EM - Particle Filtering Hybrid Algorithm 

The graph in Figure 1(b) hides the fact that the connectivity of the graph changes de- 
pending on the position of the object. Each pixel in the object can be connected 
to any pixel in Y, depending on the position x^. Another way of viewing this is that 
every pixel in the object connects to every pixel in the image, and the position of the 
object determines which edges are “turned on”. Thus the graph is hugely loopy. Once a 
position has been chosen, the connectivity of the graph is dramatically reduced'. 



Algorithm 1 EM - Particle hltering hybrid algorithm 

Initialize model parameters /io, Vo, ■ 

for nGEM = 1 to num^G EM iterations do 

Approximate E step 

Sample particle set {a;}o from location prior p(a:). 
for / = 1 to num_frames do 

{x}'f sample{p{xs,f\xs,f-i)) - send particles through dynamic distribution 
Estimate parameters of approximate posteriors ,fj^ , fj^ and /Iq 
C alculate particle weights q{xa) 

{x} / resample{{x}'f , (g(xs)}) - re-sample particles based on weights 

end for 

M step 

Update model parameters po, Vo, V^ a' 

end for 



* The graph still has “horizontal” chains, which we will discuss in the next section. 
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It is problematic to use a parametric distribution for the position variable x since we 
need to integrate over it which entails integrating over discrete topologies of the graph. 
This is the motivation for representing the location variable with a particle set, and using 
particle filtering for inferring posteriors for the location variable x. Algorithm 1 shows 
the hybrid EM - particle filtering algorithm, for stereo scene analysis. 

When learning, we start by sampling from a location prior, and initializing the param- 
eters of the background and object models to random values. In the E step, we compute 
posterior distributions for the appearance models, and weights for each location particle. 
When going to the next frame, we re-sample the particles based on those weights and 
the particles are then passed through a dynamic distribution. The M step is performed 
after going through the whole sequence of frames. 

Various extensions of the basic particle filtering algorithm are possible, e.g. that 
use proposal distributions[15] or iterative updates within each frame[16] to get a more 
representative particle set for the location. 





unobserved 




(a) Portion of epipolar graph. 



(b) Geometric interpretation 



Fig. 2. (a) Portion of graphical model corresponding to an epipolar line. Notice that once the 
observation nodes have been set, the posterior of the random variable o\ is dependent on the 
directly observed pixels and and on and (and so on), which are not directly 
observed. The dependence comes from their influence on the background nodes and . In 
Section 5.2 we describe how the chain is factored.(b) Geometric interpretation of graph in (a). 



5.2 Graph Factorization 

For a particular setting of the position variable x, the original graph factors into chains 
along the epipolar lines. In other words, the posterior distribution of a pixel in the object 
is not only dependent on the directly observed pixels it impinges on, but also depends 
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indirectly on a large number of other pixels along the same epipolar line. Part of such a 
graph is shown in Figure 2. In order to make inference efficient we would like to factor 
the model and omit the dependence on pixels that are not directly observed^. This can be 
accomplished by assuming that only the directly observed pixels in the camera sensors 
are observed and all other pixels are unobserved. This has the effect of decoupling the 
graph and leads to an approximation for the true posterior. From the perspective of 
inference, assuming that neighboring pixels are unobserved is equivalent to allowing 
those pixels to take on any values, including the values actually observed. 



5.3 Posterior Distributions of E-Step 

We now turn our attention to the posterior distributions for the object model and the 
position X. These distributions are required in the E step of the learning algorithm. We 
omit discussion for the posterior distributions for the background and mask due to space 
constraints as they are intuitively analogous. 

The manifestation of stereo in the equations below is one of the more important and 
pleasing result of this paper. Terms that can be interpreted as “appearance” terms as 
well, as “stereo” terms, fall out naturally from the generative model without any ad-hoc 
combination of these concepts. 



Posterior for Object . By assuming only the directly observable pixels in the sensors 
are observed, the posterior associated with pixel i in the object becomes 



p{vl,ol,v'li,v°^r\x,y^^i,ylr) 

, o] , v^i , , X, y'-^i , y^r) 

_ip{ol = ^)p{y‘^i\v},x)p{y^r\vl,x)p{v°,)p{v^r)p{vj)p{x) if o} 



(9) 

(10) 

1 

0 

( 11 ) 



To get the posteriors over the pixels of the object, we marginalize out oj, and 
■ The posterior for v } , given a location and the sensor images is a mixture of two 
Gaussians 



Pi.^i 1^5 7 mi , pobservedj Pohserved) (12) 

)mo , pnot observed-! Pnot observed) (13) 

where c is a normalizing constant, aj is the prior for the mask variable, and and wq 
are the mixture weights. 

This is a very intuitive result. The first mixture is for the case that the mask is opaque 
for that pixel, and the second mixture is for the case that it is transparent. The mode of 

^ We also experimented with variational inference. Using variational inference, we were unable 
to learn the parameters of the occlusion variables. We believe this is due to the omission of 
important dependence structure, which the mean field approximation ignores. 
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the “opaque” component is 



^observed 






1 

1 + A' + A’- L 



+ A’'7/^r 



(14) 



which is a weighted average of what is observed, and the prior mode rjj. The weight wi 
for this component is composed of two Gaussian factors 



wi = N{y^^i - ylr]0, 



A'A’- 
A' + A’' 



).iV( 



1 

A' + A’- L 



X% + X'^vlr 






(A' + y)y\ 



A' + A'- + iql 



t)- 



(15) 



The first factor is the “stereo” factor, which is maximized when there is a close corre- 
spondence between what is seen in the left and right images i.e. j/^, = , and the second 

factor, the “appearance” factor, is maximized when the prior for the object appearance 
matches the (weighted) mean observation. Hence the weight will be large for cases 
when there is good stereo correspondence and the observation matches the prior. 

The second component in the posterior in Eqn.(12) is for the case when the mask is 
transparent. In this case the mixture component is just equal to the prior. The weight wq 
for this component contains two factors that can be thought of as measuring the evidence 
that the observed pixel came from the background. 



Wo = N{y‘^i,y°^i, 






0^1 



A' 



) • N{yl 



,4r 



r7° 






+ 



(16) 



The first term is maximized when the observation matches the left background pixel, and 
the second term is maximized the right background pixel matches the observed pixel in 
the right camera 

Notice that Equation (12) is for a particular position of the the object. The approx- 
imate posterior for the object appearance, can now be written as a Gaussian mixture 
model with a large number of mixtures. In fact it will have 2 • nsamp mixtures, where 
nsamp is the number of particles in {xg}. The weight of each mixtures is the particle 
weight q{xs)- Hence, the posterior of the object appearance is 



Q{vl\y'^,,ylr) = '^q{xs)p{vl\xs,y'^i,ylr). 

Xs 



(17) 



5.4 Posterior for x 



The posterior for the position variable x is represented by the particle set {xg} and asso- 
ciated weights {g(xg) }. The posterior distribution for the position x can be approximated 
at the position of the particles Xg as 



p(xg|Y', Y’') « q{xs) 



p(xg,Y^Y0 

EkPi^k,Yi,Yry 



(18) 



To arrive at an expression for the weight of a particle, we need to integrate over all 
parametric distributions (Rao-Blackwellization). By doing so, p(xg,Y*,Y’') can be 
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shown to be 



= n Pi^sX ,Y^ ,v} ,V° ,0^)dvldV°dO^ 

i 

= n + (1 - “i)^o(*)] (19) 



where a}, wo(i) and wi(i) were dehned above. 



5.5 Generative Probabilistic Interpretation of SSD 

The Sum of Squared Differences (SSD) metric is commonly used in stereo vision [3, 
10 ] to measure how well a patch in one image matches a patch from another image, as 
a function of disparity. It is interesting to note that SSD can be seen as a component or 
special case of Equation (18). 

Equation (18) gives the posterior distribution p(xs|Y*, Y’') for the location of the 
object and can be interpreted as measuring the “fit” of the hypothesized position to the 
observed data. Recall that Equation (18) contains both “appearance” related terms and 
“stereo” related terms. 

To see the relationship of Equation (18) to the SSD metric, we assume that the 
appearance model is completely uninformative ( 77 * = 0 ), that the object is completely 
opaque (a^ = 1 for all i), and take the log to arrive at the form 



log(p(a;|Y', Y’’)) cx log ( [a\wi{i) + (1 - a^)wo(i) 



Y^logwi(i). ( 20 ) 



Recall that the hrst term in the weight wi is 0, Hence, for 

this special case 

log(p(a;|Y', Y^)) cx ~ yh(x,i)f (21) 

i 

which is exactly equivalent to the SSD over the whole image. 



Frame 1 Frame 3 Frame 5 Frame 7 Frame 9 




Fig. 3. Training data consists of a short sequence of 10 stereo video frames. The frames were down 
sampled to 64x48 pixels. The figure shows the frames from the left camera. Notice that the person 
approaches the camera from the the right and then recedes to the left. The trajectory is shown in 
Figure 6. 
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6 Experiments 

A short video was recorded using a stereo video camera. The frame rate was 2 frames 
per second. A subset of the 10 frame sequence used to train the model is shown in Figure 
3. Notice that the person approaches the camera from the the right and then recedes to 
the left. 

Figures 4. and 5 . shows the models that were found as a result of running the algorithm 
on the 10 frames shown in Figure 3. In these experiments, 500 particles were used. As can 
be seen in Figure 4., the background image is learned precisely in most areas. However, 
in areas where the background is never seen, the background has not been learned, and 
the variance is high. 



mean variance 




20 40 60 20 40 60 



transparency Generated from model 




Fig. 4. Model learned for object. The model is comprised of an Gaussian appearance model 
and a discrete transparency model O^. The leftmost figure shows the mean of V^, the second 
figure shows the variance of (notice higher variance on the forehead). The third plot shows 
the probability of each pixel being opaque. The rightmost plot shows an object generated from 
the model. Notice that patches of the background where there is no detail, have been associated 
with the object. 



V° varinace 

10 
20 
30 
40 

10 20 30 40 50 60 

Fig. 5. Model learned for background. The model V° is a multivariate gaussian. The left images 
shows the means, and the right image shows the variances of V°. Notice that the areas where the 
background was never observed remain the color of the object and have high variance 



- ( ji ■ 

/"FTj 




10 20 30 40 50 60 



The transparency mask has been learned well, except in areas where there is no 
texture in the background which would allow the model to disambiguate these pixels. 
Notice that the appearance model has been learned quite well. As can be seen in Figure 
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3., highlights and specularity of the forehead, nose and shirt vary between frames. The 
consequence of this is the large variance in these areas. A second factor that introduces 
variance is that the model assumes the object is flat. Hence, there will be distortion due 
to the different perspectives of the two cameras. The model allows for this discrepancy 
by assigning larger variance to the object appearance model along the edges of the face. 
A third source of variability comes from the inference algorithm itself. The sampling 
resolution can be too coarse, which prevents the algorithm from accurately flnding the 
mode of the location posterior. This does not seem to be a problem here. This effect can 
be reduced in a number of ways, including increasing the number of particles and using 
higher order dynamics in the temporal distribution. 







X position 

Fig. 6. The mode of the location distribution in iteration 9 of the EM algorithm. The units are 
approximately centimeters. Notice that there is considerable variation in both depth and horizontal 
location. 



Figure 6. shows the trajectory of the mode of the distribution for the location variable 
X. The figure clearly shows a right to left trajectory of the person that starts in the right 
hand side of the frame, moves closer and to the center and then recedes to the left. 

7 Discussion 

The algorithm requires a large number of coordinate transformations as well as eval- 
uations of posteriors for the transformed images. The complexity of the algorithm is 
0((to + n) - it ■ nsamp ■ fr) where m is the number of pixels in the background, n is 
the number of pixels in the object model, it is the number of iterations of GEM, nsamp 
is the number of samples and fr is the number of frames^. 

The transformations required for inference and learning resemble those used in com- 
puter graphics. Commodity 3-D graphics accelerators are capable of performing the 
required computations at high speeds and we anticipate that a fast implementations can 
be achieved. 

^ Each frame takes about 15 seconds on a 2.8GHz Pentium running Matlab code. 
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In some cases, it is a poor assumption that the background is at a relatively large 
distance and can be modelled as a planar surface. This can happen when there are 
stationary objects in the scene at a similar distance as the object we wish to model and 
track. For this case it may be advantageous to use separate background models for the 
two cameras. 

Particle Filtering and other Markov Chain Monte Carlo methods are considered slow 
techniques. In addition, when a generative top-down model is used, exact inference will 
theoretically require the search over a huge space of possible configurations of the hidden 
model. With continuous location variable, this space is in fact infinite. Despite this, we 
are able to both track and learn the appearance of the objects in a scene. This is partly 
due to the advantageous prior structure imposed by the top-down model, partly due to 
the strong disambiguating information provided by stereo views of the scene and partly 
due to an inference algorithm that is able to search only over the regions of the hidden 
variable space that are likely to contain the best explanation for the visible scene. 

Stereo information allows the algorithm to latch on to the correct position of the object 
immediately, even when the appearance model is of no help e.g. when it is initialized 
to random values. Hence, stereo information allows the algorithm to track and learn 
appearance without any prior knowledge of the appearance of an object. 

When we applied an equivalent monocular algorithm using a single camera to the 
above data, the algorithm did not track the object, did not learn the object model and 
consistently fell into local minima. However, once an appearance model has been learned 
(using stereo) one can switch to using a single camera to track the object. 

A strength of generative probabilistic modes is the consistent fusion of multiple 
types of information where noise and uncertainty are correctly taken into account. In 
the current paper, we fuse appearance, stereo views and views through time to learn a 
single underlying representation that explains a scene. Information from multiple frames 
is automatically used to fill in portions the model that are only observed in a subset of 
frames. 

The framework uses a (simple) generative 3D model of a scene and we show that we 
can successfully perform inference in such a top-down model. In contrast, the majority of 
methods in computer vision are bottom up methods. Aggregation of multiple sub models 
into larger models is a challenge for such approaches. Hence, we believe the extension of 
the current paradigm to be a very fruitful direction of further research, especially when 
it is desirable to construct consistent 3D representations of the world. 
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Abstract. We introduce the isophotic metric, a new metric on surfaces, 
in which the length of a surface curve is not just dependent on the curve 
itself, but also on the variation of the surface normals along it. A weak 
variation of the normals brings the isophotic length of a curve close to 
its Euclidean length, whereas a strong normal variation increases the 
isophotic length. We actually have a whole family of metrics, with a 
parameter that controls the amount by which the normals influence the 
metric. We are interested here in surfaces with features such as smoothed 
edges, which are characterized by a significant deviation of the two prin- 
cipal curvatures. The isophotic metric is sensitive to those features: paths 
along features are close to geodesics in the isophotic metric, paths across 
features have high isophotic length. This shape effect makes the isophotic 
metric useful for a number of applications. We address feature sensitive 
image processing with mathematical morphology on surfaces, feature sen- 
sitive geometric design on surfaces, and feature sensitive local neighbor- 
hood definition and region growing as an aid in the segmentation process 
for reverse engineering of geometric objects. 



1 Introduction 

The original motivation for the present investigation comes from the automatic 
reconstruction of CAD models from measurement data of geometric objects. 
In this area, called reverse engineering of geometric objects, a variety of shape 
classification methods have been developed, which aim at a segmentation of the 
measurement data into regions of the same surface type [26]. Particularly for 
traditional geometric objects, where most of the surfaces on the boundary of 
the object are fundamental shapes, the surfaces are often separated by edges 
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or smoothed edges, so-called blending surfaces. Thus, it is natural to look at 
geometric processing tools on surfaces which are sensitive to such features. 

Inspired by image processing, which frequently uses mathematical morphol- 
ogy for basic topological and geometric operations [9,22], we have been looking 
for similar operations on surfaces. However, we found that just a few contribu- 
tions [10,13,17,19,20,28] extend morphology to curved manifolds or meshes and 
cell decompositions on curved manifolds; none of these papers deals with the 
behavior at features. Thus, we will focus here on feature sensitive mathematical 
morphology on surfaces. We implement this through the use of adaptive structur- 
ing elements (SE), which change their shape and/or size based on either spatial 
position [2] or image content. The latter has been used, for example, in range 
image processing [27]. In these images, the pixel values represent distances to 
the detector, and hence they can be used to adapt the SE size to the expected 
feature size. To define appropriate SEs, we have developed an adapted metric on 
a surface, which we call isophotic metric. In this metric, the length of a surface 
curve depends not only on the curve, but also on the surface normal field along 
it. SEs, which are geodesic discs in the isophotic metric, behave in the right way 
at features that are characterized by a significant deviation of the two principal 
curvatures. 

The isophotic metric also simplifies the definition of local neighborhoods 
for shape detection, the implementation of region growing algorithms and the 
processing of the responses from local shape detection filters (images on surfaces) . 
For example, the neighborhoods of a point shown in Fig. 2 are not equally useful 
for local shape detection: The neighborhood based on the Euclidean metric (left) 
flows across the feature. However, the other neighborhoods (middle and right) 
respect the feature and are more likely to belong to the same surface type in 
an engineering object. Another example is depicted in Fig. 6: Region growing 
based on a feature sensitive metric can easily be stopped at features. Yet another 
application is design on surfaces: geodesics in a feature sensitive metric nicely 
follow features (Fig. 3) and morphology in such a metric could be used for artistic 
effects which are in accordance with the geometry of the surface (Fig. 7). 

1.1 Previous Work 

Mathematical morphology provides a rich and beautiful mathematical theory as 
well as a frequently used toolbox for basic topological and geometric operations 
on images [9,22]. Almost all of the work in discrete morphology is in R", where 
the group of translations generates in a natural way the geometry of Minkowski 
sums. The latter are the basic building block for further powerful morphological 
operations. A few contributions go beyond this framework and in a direction 
which is close to our approach. As long as we are looking just for topologi- 
cal neighborhoods in meshes as discretizations of curved surfaces, we may use 
morphology on graphs [10,28]. The special case of 2D triangle meshes with the 
Delaunay property has been investigated by Lomenie et al. [13]. Topological 
neighborhoods on triangle meshes are also employed in a paper by Rossi et al. 
[20], which uses morphology for the extraction of feature lines on surfaces. 
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A basic problem in the extension of morphology to surfaces is the fact that 
there are no really useful translations, since parallel transport known in differ- 
ential geometry is path dependent in case of non-vanishing Gaussian curvature. 
This problem has been addressed [19], but to our knowledge the studies have not 
been pursued towards efficient algorithms and practical applications. A simple 
geometric way to overcome the lack of translations is the use of approximants 
to geodesic circles as local neighborhoods (positions of the structuring element). 
The continuous viewpoint leads to morphology based on the distance function. 
There is beautiful work on this topic, mainly based on mathematical formula- 
tions with partial differential equations (PDE); see [1,4,21] and the references 
therein. The present paper is also related to geodesic active contours [3,21] in 
the sense that an appropriate Riemannian metric simplifies the formulation of 
the problem. 



1.2 Contributions of This Paper 

In our work we also use the PDE formulation; however, the metric and the 
resulting distance functions are more general. The distance functions we derive 
are based on the Gaussian mapping 7 from a surface to the unit sphere S'^, 
a basic concept in differential geometry [6] . The main contributions of our work 
are the following: 

— We define the isophotic metric, study its basic geometric properties, and 
discuss its analytical treatment for relevant surface representations (Sect. 2). 

— The governing equations of distance fields in the new metric are elaborated 
and efficiently solved in a numerical way (Sect. 3). 

— We introduce feature sensitive morphology on surfaces, which is based on the 
new metric and present applications in Gomputer Aided Design (Sect. 4). 



2 The Isophotic Metric 

Let us consider a surface C We assume that we have chosen, at least 
locally, a continuous orientation of the unit normal vectors of d>; n(p) denotes 
the unit normal vector at the point p € <P. The Gaussian map 7 from 

<1> to the unit sphere S'^ maps a surface point p to the point n(p) G (see e.g. 
[6]). The preimage 7“^ of a circle c C S'^ is a curve on <P, called an isophote. The 
surface normals along an isophote form a constant angle with the rotational axis 
of c. These curves of equal brightness in a very simple illumination model have 
been studied in classical constructive geometry; more recently they have been 
used in Gomputer Aided Design for quality inspection of surfaces [16]. We now 
define the purely isophotic metric on a surface as follows: The isophotic length 
of a curve c on the surface is the Euclidean length of its Gaussian image curve 
7(c) C S^. This metric obviously has the following simple properties: 

— The shortest distance between two surface points is the angle (g [0,7t]) be- 
tween their surface normals. 
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Fig. 1. Isophotes on an elliptic paraboloid (left) and a hyperbolic paraboloid (right). 



~ Let us fix a point m G A geodesic circle in the isophotic metric, i.e. the 
set of all points p G that lie at constant isophotic distance r to m, is an 
isophote Cr- The Gaussian image of this isophote is a circle with center 7 (m) 
and spherical radius r. 

— A geodesic g on ^ in the sense of the isophotic metric possesses as Gaussian 
image a geodesic on the unit sphere, i.e. a great circle. Let Og denote the 
rotational axis of this circle. Then, at each point p of g the surface normal 
n(g) is orthogonal to Og. Gonsidering a parallel projection in direction of 
Og, the curve 5 is a silhouette (contour generator) on <P. These curves have 
been extensively studied in constructive geometry and in Gomputer Vision 
(see e.g. [5]). 

Example 1: Gonsider the paraboloid 

r :2z = KiX^ + K 2 y^- (1) 

Let us compute the isophotic geodesic circles with center at the origin, i.e., the 
isophotes for the direction e = (0, 0, 1). The direction of the normal at a surface 
point is given by the vector (kix, ^ 2 ?/, — 1). The angle a between the normal 
n(p) at an arbitrary point p G T and e satisfies cos^ a = + 1). 

Thus, the isophotes to cos^ a = (? = const are given by 

kIx^ + , with ( 2 ) 

In the xy-plane, these curves are concentric and similar ellipses. In (2) de- 
scribes elliptic cylinders which intersect the paraboloid F in the actual isophotes 
(see Fig. 1). Note that the ratio of axis lengths of the ellipses (2) is pi : p 2 , where 
Pi := l/|«;i| are the principal curvature radii of F at the origin. We have used 
this example since it reveals important information for the general case as well. 
We may approximate any regular surface at an arbitrary point m up to 
second order by a paraboloid T(m). In a local Gartesian frame with origin at 
m and with the principal curvature directions and the surface normal as coor- 
dinate axis directions this paraboloid is written in the form (1). Here, Kt are the 
principal curvatures of and F at m. Our example now describes the behavior 
of ‘small’ isophotes around m. Viewing the family of shrinking isophotes for 
AT 0 as a curve evolution (cf. Fig. 1), we may say in usual terminology, that 
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Fig. 2. Approximate geodesic circles on a triangle mesh in the Euclidean metric (left), 
purely isophotic metric (middle) and isophotic metric (right). 



this family shrinks to an elliptic point with axis ratio p\ : p 2 - If we magnify the 
isophotes during the evolution so that they keep e.g. their length, the limit is 
an ellipse in the tangent plane at m, whose axes agree with the principal axes 
of the surface and whose axis ratio is pi : p 2 - The discussion of this example 
shows the following two important facts, the first of which is desirable but the 
second one is not: 

— Isophotic geodesic discs around a center m are interesting candidates for 
structuring elements in mathematical morphology on surfaces. They are 
elongated in direction of large normal curvature radii and they are of smaller 
width in direction of small normal curvature radii. This anisotropic behavior 
is useful if we are working along surface features which are characterized 
by a significant deviation between the two principal curvatures, e.g. along 
smoothed edges, blends and similar curve-like features. 

— At points with vanishing Gaussian curvature, K = kiK 2 = 0, at least one 
principal curvature Hi vanishes and the metric degenerates. In the example 
of Fig. 2, the triangle mesh is close to a developable surface and thus the 
isophotes Fig. 2 (middle) are close to straight lines, namely the rulings on 
the developable surface. 

Keeping the first property and eliminating the second one has a simple solu- 
tion: the purely isophotic metric is regularized with help of the Euclidean metric 
on ([>. More precisely, we define the regularized isophotic metric, henceforth often 
briefly denoted as ‘isophotic metric’, via the arc length differential 

dsf = w ds^ + w*{ds*)‘^ , (3) 

where ds is the arc element on the surface and ds* is the arc element on its Gaus- 
sian image; w > 0 and w* > 0 are the weights of the Euclidean and isophotic 
components, respectively. In the simplest form, the weights will be chosen con- 
stant. They can however also be dependent on some appropriate function defined 
on the surface d>. The choice of the weights offers a further tool to design appro- 
priate structuring elements for mathematical morphology on <P. 
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2.1 Computation on Parametric Surfaces 

The computation of the isophotic metric uses only a few basic facts from dif- 
ferential geometry. Let us consider a parameterized surface x('u, v) and a curve 
c(t) on it, given by its preimage {u{t),v{t)) in the parameter plane. The first 
derivative vector c of the curve c{t) = satisfies 

c^ = c • c = (mx„ -I- uxt,)^ = giiii^ + 2gi2uv + g 22 V^ ■ (4) 

Here x„, yi^ are the first order partial derivatives of x; their inner products, 

9ii = 912 = • x„, 522 = x^, (5) 

form the symmetric matrix I = (gik) of the first fundamental form. It allows us 
to perform metric computations in the tangent spaces of the surface directly in 
the parameter domain. For example, the computation of the total arc length of a 
surface curve by means of its preimage u = (u(t), v(t)) in the parameter domain 
is done with 




The same can be done with the Gaussian image of the surface. Unit normals are 
computed as 

X„ X X^ X„ X x„ 

1 1 = = 

II x„ X X^ll \/ 911922 — 9 i 2 

Thus, the first derivative of the image curve c*(t) = 7 (c(t)) = n(u(t),u(t)) on 
the Gaussian sphere satisfies 

(c*)^ = (■un„ -I- = liiv? + 2li2uii + l 22 V^ ■ (7) 

Here, the inner products of the partial derivatives of the unit normal field, 

hi = nl, li 2 = n„ • n„, I 22 = n^, (8) 

form the symmetric matrix III of the so-called third fundamental form. This 
matrix, which is not regular at points with vanishing Gaussian curvature K, 
defines the purely isophotic metric on the surface in exactly the same way as 
the first fundamental matrix I describes the Euclidean metric on the surface. 
Finally we see that the regularized isophotic metric has the fundamental matrix 

M = wl + w*III = {wgij + w*lij). (9) 

With help of M, one introduces a Riemannian metric in the parameter domain of 
the surface, and one can use the familiar framework from differential geometry 
to perform computations. For example, the total arc length of a curve in the 
isophotic metric is given by (6) with M instead of I. Figure 3 shows several 
geodesic curves we have computed on a parametric surface using / and M. 
Three pairs of input points are each connected with a Euclidean geodesic and 
a regularized isophotic geodesic. The latter metric forces the geodesic curves to 
follow the features of the surface. 
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Fig. 3. Geodesic curves on a parametric surface with features: (light colored) computed 
in the Euclidean metric, i.e. u> = 1 and w* = 0 in (9); (dark colored) computed in the 
isophotic metric, for ui = 1 and w* = 2 in (9). 



Remark 1. We may associate with the surface x(u,v) the 2-dimensional surface 
X{u,v) = {^/wx{u,v),'/^n{u,v)) C K®. Then the canonical Euclidean metric 
in R® induces on the manifold X exactly the regularized isophotic metric; its 
first fundamental form agrees with (9). In this sense, the isophotic metric has 
some relation to work on image manifolds, if we consider the unit normals as a 
vector valued image on the surface [11]. 



2.2 Computation on Implicit Surfaces 

In view of the increasing importance of implicit representations and the elegance 
of the level set method for the solution of a variety of problems in geometric 
computing [1,15,23], it is appropriate to address the computation of the isophotic 
metric if we are given an implicit representation -F(x) = 0 of the surface. There 
is nothing to do for the Euclidean metric. We simply use the canonical Euclidean 
metric in described by the identity matrix E = {Sij). The restriction to any 
level set surface <Pc ■ F{^) = c = const is the metric on the surface. 

We are now constructing another metric in whose restriction to E(x) = 0 
is the desired isophotic metric. For any x £ K® in the domain, where F is defined, 
the normalized gradient vector n(x) = VF/jjVFjj, describes the unit normal of 
the level set of F which passes through x. Thus, the mapping x H> n(x) extends 
the Gaussian mapping to the set of all level sets of F. The image lies on the unit 
sphere. The first derivative of this extended Gaussian mapping has the (singular) 
matrix J := (ua,, Uy, n^). Hence, the squared (purely) isophotic length jjvj]^ of 
a vector v (tangent vector of R® at x) is 

I|v||J = (J-v)2 = vG7V.v, (10) 

where the matrix N = {nij) = J* ■ J is the Gramian of the partial derivatives of 
n. Finally, the matrix 



M = wE + w*N = {wSij + w*nij), 



( 11 ) 
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describes a Riemannian metric in whose restriction to any level set surface 
<Pc = is the corresponding regularized isophotic metric on <Pc- 

Note that the expressions become particularly simple for a signed distance 
function F, since it satisfies ||VF|| = 1. Moreover, it is well-known how to 
efficiently compute the signed distance function to a surface, even if it is given 
just as a cloud of points [24]. Therefore, the implicit framework can be used to 
perform computations basically directly on clouds of measurement points. 

3 Distance Fields in the Isophotic Metric 

A distance function d on a surface <P is characterized by the Eikonal equation 

l|V^df = l, (12) 

where is the surface gradient of d. V,pd is a tangential vector of the surface, 
points in direction of the largest positive directional derivative of d, and its norm 
is equal to this derivative. For a parametric representation x(u, v) of ^ with first 
fundamental matrix /, we can express this equation in terms of the ordinary 
gradient, 

Vd= 1. 

Here d = d(rt, v) is the representation of the distance function in the parameter 
domain, so that d{u,v) equals the distance value d{x{u,v)) of the surface point 
x(m, v) . Moreover, Vd = (d„, d„) is the ordinary gradient of the bivariate function 
d. For a distance field in the isophotic metric, we just replace the matrix / by 
the matrix M from equation (9), 

(Vd)‘-M-i-Vd'=l. (13) 

This is a 2D Hamilton- Jacobi equation and therefore the numerical computation 
of an isophotic distance field to some point set can be done with the fast sweeping 
algorithm by Tsai et al. [25]. The examples in Fig. 4 have been computed in this 
way. One can show that the computation of isophotic distance fields on implicitly 
defined surfaces can proceed along the lines of [14]: With M from (11) we solve 
the 3D Hamilton- Jacobi equation, 

(Vd)‘-M”^-Vd=l, (14) 

in a small neighborhood of the surface. Here, a 3D extension [7] of the algorithm 
by Tsai el al. [25] can be used. 

4 Application to Feature Sensitive Morphology on 
Surfaces 

4.1 Continuous Morphology 

Let us consider a black image on a white surface. On the surface we have intro- 
duced a metric. In our case this is the isophotic metric, but it could be another 




568 



H. Pottmann et al. 




Fig. 4. Level sets to uniformly spaced values of the distance field to a given region in 
the Euclidean metric (left) and isophotic metric (right). In the isophotic metric, the 
level sets accumulate at features. 



one as well. Then, the distance field to the black part B possesses level sets which 
are the boundaries of the dilated versions of B. Thus, dilation means growth with 
help of the distance field (see Fig. 4). Likewise, erosion can be defined as dilation 
of the white background, again with the distance field. Combinations of dilation 
and erosion, which yield closing and opening, are straightforward. Furthermore, 
extensions to labelled meshes, in which faces are assigned values from a small 
set V , is relatively straightforward through the use of series closings on indexed 
partitions as defined in [8]. For the use of the isophotic metric in feature sensitive 
morphology, we should note the following effects: 

~ Applying a dilation with high isophotic part (w* ^ w) to a domain adjacent 
to a feature will make distances across that feature very large and thus avoid 
a flow across the feature (see Fig. 4, right). 

— Application of a dilation with high Euclidean part {w ^ w*) to a, domain 
along a feature will fill interruptions along the feature, but not significantly 
enlarge the domain across that feature (Fig. 5, left). 

— A closing operation of a thin domain along a feature is achieved by applying 
to it first a dilation with high Euclidean component and then an erosion 
with high isophotic part (Fig. 5). 



4.2 Discrete Morphology 

We split the discussion into two parts. At first we discuss local neighborhoods, 
to be understood as positions of the structuring element. Secondly, we show 
how to use the neighborhoods - independently from their creation - in the 
formulation of morphological operators. In both cases we confine ourselves to 
triangle meshes, but the extension to other cell arrangements, even for manifolds 
of higher dimension, is rather straightforward. 
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Fig. 5. Continuous morphology: Closing of a domain (black) along a feature: First, a 
Euclidean dilation (distance field on left side) is applied (result of dilation white on 
right side), then an erosion with high isophotic part yields the closed domain (black, 
right). 

Neighborhoods. The combinatorial neighborhood Ni{Aj) of depth one to a 
triangle Ai consists of all triangles in Z\, which share at least one vertex with 
Ai. The neighborhood Nk is defined by iterating the procedure: in step k we 
add all triangles which share at least a vertex with the boundary of Nk-i- For 
a nearly uniform triangulation, the neighborhoods Nk are good approximants 
to geodesic circles. For a neighborhood in the isophotic metric one has to gather 
triangles around Ai, whose isophotic distance falls below a given threshold. We 
have implemented the computation of these geodesic discs following an idea by 
M. Reimers [18], which appears for grids already in [24]. In view of Remark 1 we 
compute a Euclidean distance field to a triangle on a triangle mesh in K®, which 
represents a two-dimensional surface. The only difference to the work of [18] is 
the dimension of ambient space, which is irrelevant for distance computations in 
the mesh. The examples in Figs. 2, 6, 7 have been computed in this way. 



Morphological Operators. Let us first describe the dilation of level k of 
black elements on a white background. At each triangle Ai of the triangulation 
we compute the local neighborhood Nk{Ai) and set the color of Ai to black if 
at least one of the triangles in the neighborhood is black. If the neighborhoods 
approximate geodesic discs sufficiently well in some metric (e.g. the isophotic 
metric), we have the following counterpart to the planar case: performing k 
times a dilation of level one is the same as performing once a dilation of level 
k. If we use a structuring element (SE) based on the isophotic distance, then it 
is inevitable that there will be some triangles which lie partly within and partly 
outside the isophotic distance threshold. A simple solution to this problem would 
be to assign the triangle to the SE if more than a specified proportion of its 
surface area is inside the distance boundary. A more flexible approach would 
be to make use of non-flat SEs [22, p. 441], having values influenced by the 
proportion of a triangle lying within the distance threshold. 
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Fig. 6. Discrete morphology: Dilation in the isophotic metric (9) with w = 0.5 and 
w* = 0.5: Starting with the dark triangles (left) we get the result shown (right). Note 
that the isophotic metric prevents a flow across features. 



An erosion of level k of black parts is just a dilation of level k applied to 
the white background. A morphological closing operation first applies a dilation 
of level k, and then an erosion of level k. This fills holes. The opening operator 
applies the erosion before the dilation, which removes thin connections between 
more compact parts. The width of the bridges to be removed is related to k. 

We present examples of discrete morphology on real 3D data. For this pur- 
pose we scanned an engineering object (Fig. 6) and a clay model (Fig. 7) with 
a Minolta VI-900 3D laser scanner, and then triangulated the obtained point 
clouds to produce the meshes shown in the respective figures. The example in 
Fig. 6 demonstrates that feature sensitive mathematical morphology can aid the 
segmentation of an object into its fundamental surfaces; this holds with respect 
to the definition of local neighborhoods for shape detection, the implementa- 
tion of region growing algorithms and the processing of the responses from local 
shape detection filters (images on surfaces). The example in Fig. 7 supports our 
expectation that morphology in the isophotic metric could be used for artistic 
effects which are in accordance with the geometry of the surface. Furthermore, 
the geodesic curves shown in Fig. 3 indicate the usability of the isophotic metric 
for feature sensitive curve design on surfaces, e.g. for patch layout in connection 
with high quality freeform surface fitting to clouds of measurement points. 



5 Conclusion and Future Research 

We have introduced and studied the isophotic metric, discussed some basic com- 
putational aspects, and presented examples on its application to feature sensi- 
tive morphology on surfaces and geometric design on surfaces. Both the efficient 
computation as well as the application to morphology require further studies. 
Promising extensions of the concept are feature sensitive design of energy mini- 
mizing splines in the sense of the isophotic metric, and robot path planning, both 
on surfaces. Another subject of ongoing and future research is a modification of 
the isophotic metric so that it serves as a tool for image processing in arbitrary 
dimensions. Here, we interpret a grey value image as a hypersurface, but use - in 
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Fig. 7 . Dilation of the dark triangles (left) on a triangulated surface: (middle) in the 
Euclidean metric, (right) in the isophotic metric (9) with w = 0.2 and w* = 0.8. 



accordance with the work of Koenderink and van Doom [12] - isotropic rather 
than Euclidean geometry in ambient space. 
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Abstract. Recovery of three dimensional (3D) shape and motion of 
non-static scenes from a monocular video sequence is important for ap- 
plications like robot navigation and human computer interaction. If every 
point in the scene randomly moves, it is impossible to recover the non- 
rigid shapes. In practice, many non-rigid objects, e.g. the human face 
under various expressions, deform with certain structures. Their shapes 
can be regarded as a weighted combination of certain shape bases. Shape 
and motion recovery under such situations has attracted much interest. 
Previous work on this problem [6,4,13] utilized only orthonormality con- 
straints on the camera rotations {rotation constraints). This paper 
proves that using only the rotation constraints results in ambiguous and 
invalid solutions. The ambiguity arises from the fact that the shape bases 
are not unique because their linear transformation is a new set of eligible 
bases. To eliminate the ambiguity, we propose a set of novel constraints, 
basis constraints, which uniquely determine the shape bases. We prove 
that, under the weak-perspective projection model, enforcing both the 
basis and the rotation constraints leads to a closed-form solution to the 
problem of non-rigid shape and motion recovery. The accuracy and ro- 
bustness of our closed-form solution is evaluated quantitatively on syn- 
thetic data and qualitatively on real video sequences. 



1 Introduction 

Many years of work in structure from motion have led to significant successes in 
recovery of 3D shapes and motion estimates from 2D monocular videos. Reliable 
systems exist for reconstruction of static scenes. However, most natural scenes 
are dynamic and non-rigid: expressive faces, people walking beside buildings, etc. 
Recovering the structure and motion of these non-rigid objects is a challenging 
task. The effects of 3D rotation and translation and non-rigid deformation are 
coupled together in image measurement. While it is impossible to reconstruct 
the shape if the scene deforms arbitrarily, in practice, many non-rigid objects, 
e.g. the human face under various expressions, deform with a class of structures. 

One class of solutions model non-rigid object shapes as weighted combina- 
tions of certain shape bases that are pre-learned by off-line training [2, 3, 5, 9]. 
For instance, the geometry of a face is represented as a weighted combination of 
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shape bases that correspond to various facial deformations. Then the recovery 
of shape and motion is simply a model fitting problem. However, in many ap- 
plications, e.g. reconstruction of a scene consisting of a moving car and a static 
building, the shape bases of the dynamic structure are difficult to obtain before 
reconstruction. 

Several approaches have been proposed to solve the problem without a prior 
model [6,13,4]. Instead, they treat the model, i.e. shape bases, as part of the 
unknowns to be solved. They try to recover not only the non-rigid shape and 
motion, but also the shape model. This class of approaches so far has utilized only 
the orthonormality constraints on camera rotations {rotation constraints) to 
solve the problem. However, as shown in this paper, enforcing only the rotation 
constraints leads to ambiguous and invalid solutions. These approaches thus can- 
not guarantee the desired solution. They have to either require a priori knowledge 
on shape and motion, e.g. constant speed [10], or need non-linear optimization 
that involves large number of variables and hence requires a good initial estimate 
[13,4]. ^ ^ 

Intuitively, the above ambiguity arises from the non-uniqueness of the shape 
bases: a linear transformation of a set of shape bases is a new set of eligible 
bases. Once the bases are determined uniquely, the ambiguity is eliminated. 
Therefore, instead of imposing only the rotation constraints, we identify and 
introduce another set of constraints on the shape bases {basis constraints), 
which implicitly determine the bases uniquely. This paper proves that, under the 
weak-perspective projection model, when both the basis and rotation constraints 
are imposed, a closed-form solution to the problem of non-rigid shape and motion 
recovery is achieved. Accordingly we develop a factorization method that applies 
both metric constraints to compute the closed-form solution for the non-rigid 
shape, motion, and shape bases. 



2 Previous Work 

Recovering 3D object structure and motion from 2D image sequences has a rich 
history. Various approaches have been proposed for different applications. The 
discussion in this section will focus on the factorization techniques, which are 
most closely related to our work. 

The factorization method was first proposed by Tomasi and Kanade [12]. 
First it applies the rank constraint to factorize a set of feature locations tracked 
across the entire sequence. Then it uses the orthonormality constraints on the 
rotation matrices to recover the scene structure and camera rotations in one 
step. This approach works under the orthographic projection model. Poelman 
and Kanade [11] extended it to work under the weak perspective and para- 
perspective projection models. Triggs [14] generalized the factorization method 
to the recovery of scene geometry and camera motion under the perspective 
projection model. These methods work for static scenes. 

Costeira and Kanade [8] extended the factorization technique to recover the 
structure of multiple independently moving objects. This method factorizes the 
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image locations of certain features to separate different objects and then in- 
dividually recovers their shapes. Wolf and Shashua [16] derived a geometrical 
constraint, called the segmentation matrix, to reconstruct a scene containing 
two independently moving objects from two perspective views. Vidal and his 
colleagues [15] extended this approach for dynamic scenes containing multiple 
independently moving objects. For reconstruction of dynamic scenes consist- 
ing of both static objects and objects moving along fixed directions, Han and 
Kanade [10] proposed a factorization-based method that achieves a unique solu- 
tion with the assumption of constant velocities. A more generalized solution to 
reconstructing the shapes that deform at constant velocity is presented in [17]. 

Bregler and his colleagues [6] first introduced the basis representation of 
non-rigid shapes to embed the deformation constraints into the scene struc- 
ture. By analyzing the low rank of the image measurements, they proposed a 
factorization-based method that enforces the orthonormality constraints on cam- 
era rotations to reconstruct the non-rigid shape and motion. Torresani and his 
colleagues [13] extended the method in [6] to a trilinear optimization approach. 
At each step, two of the three types of unknowns, bases, coefficients, and ro- 
tations, are fixed and the remaining one is updated. The method in [6] is used 
to initialize the optimization process. Brand [4] proposed a similar non-linear 
optimization method that uses an extension of the method in [6] for initializa- 
tion. All three methods enforce only the rotation constraints and thus cannot 
guarantee an optimal solution. Note that both non-linear optimization meth- 
ods involve a large number of variables, e.g. the number of unknown coefficients 
equals the product of the number of images and the number of shape bases. The 
performance relies on the quality of the initial estimate of the unknowns. 

3 Problem Statement 

Given 2D locations of P feature points across F frames, {{u, v)'^p\f = 1, F,p = 
1, ..., P}, our goal is to recover the motion of the non-rigid object relative to the 
camera, including rotations {Rf\f = 1, A} and translations {t/|/ = 1, ...,F}, 

and its 3D deforming shapes {{x,y, z)jp\f = = 1,...,P}. Throughout 

this paper, we assume: 

— the deforming shapes can be represented as weighted combinations of shape bases; 

— the 3D structure and the camera motion are non-degenerate; 

— the camera projection model is the weak-perspective projection model. 

We follow the representation of [3,6]. The non-rigid shapes are represented as 
weighted combinations of K shape bases {Bi,i = 1, The bases are 3 x P 

matrices controlling the deformation of P points. Then the 3D coordinate of the 
point p at the frame / is 

^fp = {x,y,z)Jp = E^iCfihip f = (1) 

where hip is the pth column of Bi and Cj/ is its combination coefficient at the 
frame /. The image coordinate of X/p under the weak perspective projection 
model is 

x/p = {u,vffp = Sf{Rf ■ ^fp + t/) 



( 2 ) 
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where Rf stands for the first two rows of the fth camera rotation and = 
[tfxtfy]'^ is its translation relative to the world origin. Sf is the scalar of the 
weak perspective projection. 

Replacing X/p using Eq. (1) and absorbing s/ into c/i and tf, we have 

/bip\ 

x/p = (^CfiRf ... c/kR/) ■ ( ••• I + 1/ (3) 

\bxp/ 



Suppose the image coordinates of all P feature points across F frames are 
obtained. We form a 2F x P measurement matrix W by stacking all image 
coordinates. Then W = MB + T[11...\], where M is a 2F x 3RT scaled rotation 
matrix, R is a 3K x P bases matrix, and T is a 2F x 1 translation vector, 





( ciiAi . 


.. CikRi ^ 




( bii ... bip \ 


M = 


ycFiRp . 


.. cfkRf y 


, B = 


^bifi ... bxp j 



As in [10,6], we position the world origin at the scene center and compute 
the translation vector by averaging the image projections of all points. We then 
subtract it from W and obtain the registered measurement matrix W = MB. 

Since W is the product of the 2F x 3K scaled rotation matrix M and the 
3 AT X P shape bases matrix B, its rank is at most min{3K, 2F, P}. In practice, 
the frame number F and point number P are usually much larger than the basis 
number K. Thus under the non-degenerate cases, the rank of W is 3K and K 
is determined hy K = rank{W)/3. We then perform SVD on W to get the 
best possible rank 3K approximation of W as MB. This decomposition is only 
determined up to a non-singular 3K x 3K linear transformation. The true scaled 
rotation matrix M and bases matrix B are of the form, 

M = M-G, B = G~^-B (5) 

where G is called the corrective transformation matrix. Once G is determined, 
M and B are obtained and thus the rotations, shape bases, and combination 
coefficients are recovered. 

All the procedures above, except obtaining G, are standard and well-under- 
stood [3,6]. The problem of nonrigid shape and motion recovery is now reduced 
to: given the measurement matrix W, how can we compute the corrective trans- 
formation matrix G? 



4 Metric Constraints 

To compute G, two types of metric constraints are available and should be 
imposed: rotation constraints and basis constraints. While using only the 
rotation constraints [6,4] leads to ambiguous and invalid solutions, enforcing 
both sets of constraints results in a closed-form solution. 
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4.1 Rotation Constraints 



The orthonormality constraints on the rotation matrices are one of the most 
powerful metric constraints and they have been used in reconstructing the shape 
and motion for static objects [12,11], multiple moving objects [8,10], and non- 
rigid deforming objects [6,13,4]. 

According to Eq. (5), MM'^ = MGG'^ M"'" . Let us denote by Q. Then, 

M2t,i-l:2*iQM2*j-l:2*j = ^k=lCikCjkRi * RJ , *,7 = 1, •••E (6) 



where represents the ith two-row of M. Due to orthonormality of 

rotation matrices. 



— ^ k — \^ik^‘2 X 2 , 



i = 1,...,F 



(7) 



where I 2 x 2 is a 2 x 2 identity matrix. Because Q is symmetric, the number of 
unknowns in Q is (9iC^ -I- ‘iK)l2. Each diagonal block of MM'^ yields two linear 
constraints on Q, 



M2ti-lQM2*i-l — M2»iQM2*i ( 8 ) 

M2.i-iQMLi = 0 (9) 

For F frames, we have 2F linear constraints on 2 ^^^^ unknowns. It appears 

that, when we have enough images, i.e. F > there should be enough 

constraints to compute Q via the least-square methods. However, it is not true 
in general. We will show that most of these rotation constraints are redundant 
and they are inherently insufficient to determine Q. 



4.2 Why Are Rotation Constraints Not Sufficient? 

When the scene is static or deforms at constant velocities, the rotation con- 
straints are sufficient to solve the corrective transformation matrix G [12,10]. 
However, when the scene deforms at varying speed, no matter how many images 
are given or how many feature points are tracked, the solutions of the constraints 
in Eq. (8) and Eq. (9) are inherently ambiguous. 

Definition 1. A 3K x3K symmetric matrix Y is called a block-skew-symmetric 
matrix, if all the diagonal 3x3 blocks are zero matrices and each off-diagonal 
3x3 block is a skew symmetric matrix. 

( 0 yiji Vij2\ 

Yij = ( —yiji 0 yij3 I = ~Yij = Yji , i A 3 (10) 

V -yij 2 -yij3 0 / 

Yu =03x3, = (11) 

Each off-diagonal block consists of 3 independent elements. Because Y is sym- 
metric and has K{K — l)/2 independent off-diagonal blocks, it includes 3K{K — 
l)/2 independent elements. 
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Definition 2. A 3K x 3K symmetric matrix Z is called a Mock- scaled-identity 
matrix, if each 3x3 Mock is a scaled identity matrix, i.e. Zij = A^Isxs, where 
Xij is the only variaMe. 

Because Z is symmetric, the total number of variables in Z equals the number 
of independent blocks, K{K + l)/2. 

Theorem 1. The general solution of the rotation constraints in Eq. (8) and 
Eq. (9) can he expressed as Q = GEIG^ , where G is the desired corrective trans- 
formation matrix, and H = Y -\- Z , with Y a block- skew- symmetric matrix, and 
Z a Mock-scaled-identity matrix. 

Proof. The solution Q of Eq. (8) and Eq. (9) can be represented as GAG^, since 
G is a non-singular square matrix. Now we need to prove that A must be in the 
form of El, i.e. the summation of Y and Z. 

According to Eq. (7), 

= M2*i-l:2*iAM2ti-V.2*i 
= 0*12x2, i=l,...,F (12) 

where Oi is an unknown scalar depending on only the coefficients. Divide A into 
3x3 blocks, ylfcj (k,j=l,...,K). Combining Eq. (4) and (12), we have 

RiEk=i{<?ikAkk -\- Sf^u+i<^ikCij{Akj -\- Afj))Rf = Oil2x2, i = 1,---,E (13) 

Denote the 3x3 symmetric matrix E^^^{cff.Akk + -|- A'^j)) by 

Ej. Let Ei be the homogeneous solution of Eq. (13), i.e. RiEiRf = 02x2- Since 
Ri consists of the first two rows of the Rh rotation matrix, let denote the 
third row. Due to orthonormality of Ri, 

Ei = rJsSi -\- Sf Vi3 (14) 

where <5^ is an arbitrary 1x3 vector. Apparently E) = Oilsxs is a particular 
solution of Eq. (13). Therefore the general solution of Eq. (13) is 

Ei = Ei,^i{ci)^Akk + Ej^p,^iCikCij{Akj + Af,j)) = Oilsxs + PiEi (15) 

where fdi is a scalar. Now let us prove PiE has to be zero. Because Q = GAG^ 
is the general solution on all images, Eq. (15) must be satisfied for any set of 
the coefficients and rotations. For any two frames i and j that are formed by 
the same 3D shapes, i.e. same coefficients, but different rotations Ri and Rj, 
according to Eq. (15), we have 

Oil3x3+/liEi = aiExsE PjEj f3iE— (3jEj — 0ax3 =4> Rj{PiEi — j3jrj)Rj = 02x2 

(16) 

According to Eq. (14), we have RjEjRj = 02x2, thus 

Rj{EE)Rj = 02x2 (17) 

Because Rj can be any rotation matrix, (3iEi has to be zero for any frame. 
Therefore, 

Ek=li,eikAkk “t“ Ejj—^j^idkCiji^Akj “I" Af„j^) = Oil3x3 



(18) 
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Because Eq. (18) must be satisfied for any set of the coefficients, the solution is 

Afcfc = Afcfcisxs (19) 

Afcj -l- Ai.j = AfcjIsxSi A: = 1, K; j = k 1, K (20) 

where Xkk and Xkj are arbitrary scalars. According to Eq. (19), the diago- 
nal block Akk is a scaled identity matrix. From Eq. (20), Akj — %^l 3 x 3 = 
— {Akj — %^l 3 x 3 )^, be. Akj — %^l 3 x 3 is skew-symmetric. Therefore the off- 
diagonal block Akj equals the summation of a scaled identity block, ^I 3 x 3 , 
and a skew-symmetric block, Akj — ^Isxs- This statement concludes the proof: 
A equals H, the summation of a block-skew-symmetric matrix Y and a block- 
scaled-identity matrix Z, i.e. the general solution of the rotation constraints is 
Q = GHG^. □ 

Because H consists of 2K^ — K independent elements: 3K{K — l)/2 from Y 
and K{K + l)/2 from Z, the solution space has 2K^ — K degrees of freedom. 
It explains why the rotation constraints are sufficient in rigid cases {K = 1) but 
lead to ambiguous solutions when the scene is non-rigid {K > 1). This conclusion 
is also confirmed by our experiments. If every solution in the space is a valid 
solution of Q, then even if the ambiguity exists, we can compute an arbitrary 
solution in the space to solve the problem. However, the space contains many 
invalid solutions. Specifically, since Q = GG^ must be positive semi-definite, 
when H is not positive semi-definite, the solutions Q = GHG^ are not valid. 
For example, when H only consists of a block-skew-symmetric matrix Y, the 
solutions Q = GY G^ are invalid because Y is not positive semi-definite. 

4.3 Basis Constraints 

Are there other constraints that we can use to remove the ambiguity of the rota- 
tion constraints? For static scenes, a variety of approaches [12,11] utilize only the 
rotation constraints and succeed in determining the correct solution. Intuitively, 
the only difference between non-rigid and rigid situations is that the non-rigid 
shape is a weighted combination of certain shape bases. This observation sug- 
gests that the ambiguity is related to the basis representation. Can we impose 
constraints on the bases to eliminate the ambiguity? 

The shape bases are non-unique because any non-singular linear transforma- 
tion on them yields a new set of eligible bases. However, if we find K frames 
including independent shapes and treat those shapes as a set of bases, the bases 
are determined uniquely^. We denote those frames as the first K images in the 
sequence and the corresponding coefficients are 

C-ii — 1, i — 1, ..., K 

Cij =0, ..., K, j = 1, ..., K (21) 



^ We can find K frames in which the shapes are independent, by examining the singular 
values of their image projections. 
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For any three-column of G, gk,k = 1, iF, according to Eq. (5), 



/ cikRi \ 

Mgk=\ ... k = l,...,K (22) 

\ cpkRp / 



We denote gkgk^ by Qfc. Then, 

M2*i-V.2*iQkM2fj-l-2*j = CikCjkRiRj (23) 

Thus Qk satisfies the rotation constraints in Eq. (8) and Eq. (9). Besides, com- 
bining Eq. (21) and Eq. (23), we obtain another 4{K — 1)F basis constraints on 

Qk- 



M2*i-iQkM2*j-i = 



M2*iQkM2tj = 



M2i-lQkM2*j = 0 , 

^2iQk^2*j — l ~ O 5 



1, i = j = k 
0, (i,j) e 0)1 


(24) 


1, i = j = k 
0 , (i,j) € 0)1 


(25) 


(i, j) £ LJi or i = j = k 


(26) 


(i,j) £ u>i or i = j = k 


(27) 



where Wi = {{i,j)\i = 1, ..., K, j = 1, ..., F, and i yf k}. 



5 A Closed-Form Solution 

Due to Theorem 1, enforcing the rotation constraints on Qk leads to the am- 
biguous solution Q = GHG^. This section will prove that enforcing the basis 
constraints eliminates the ambiguity on Q and determines a closed-form solu- 
tion. Note that we assume that the 3D structure and camera motion are both 
non-degenerate, i.e. the rank of W is 3K. 

By definition, each 3x3 block Hij {i,j = 1, ..., K) of FI contains four inde- 
pendent entries, 

/ hi h2 hz \ 

Hij = -h2 hi hi ( 28 ) 

V —hz —hi hi ) 

Lemma 1 Under non-degenerate situations, Hij is a zero matrix if, 

HHjRj = ) = 02 x 2 ( 29 ) 

Proof. First we prove that the rank of Hij is at most 2. Due to the orthonormality 
constraints, 




where Viz = rn x r^, Vjz = Uji x rj 2 , Si and Sj are two arbitrary 1x3 vectors. 
Both matrices on the right side of Eq. (30) are at most of rank 2. Thus the rank 
of Hij is at most 2. 
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Next, we prove hi = 0. Since the rank of Hij is less than its dimension, 3, 
its determinant, equals 0. Therefore hi must be 0 and Hij is a 

skew-symmetric matrix. 

We then prove h 2 = hz = h^ = 0. Since hi = 0, we rewrite Eq. (29) as 
follows: 



rn ■ (h X rji) 
Vi2 ■ (h X Tji) 



rn ■ (h X Tj2) 
Ti2 ■ (h X rj2) 



02x2 



(31) 



where h = (— /14 /13 — /i2)- Eq. (31) means that the vector h is located in the 
intersection of the four planes determined by (fii, rji), (r^i, rj2), (?'i2, fji), and 
Under non-degenerate situations, r^i, rj2, and Vj 2 do not lie in the 
same plane, hence the four planes intersect at the origin, z. e. h = (— ft.4 /13 — /12) = 
0ix3. Therefore Hij is a zero matrix. □ 



According to Lemma 1, we derive the following theorem, 



Theorem 2. Enforcing both basis constraints and rotation constraints results 
in a unique solution Q = gk9k^ , where gk is the kth three-column of G. 

Proof. Due to Theorem 1, by enforcing the rotation constraints, we achieve the 



solution Q = GHG^ . Thus MQM"’" = MHM"’" , and 

M2»i-l-.2*iHM2tj-l-2»j = E^i=lEk2 = lCikiCjk2RiHkik2Rj , = (32) 

According to Eq. (21), 

M2*i-l-2*iHM2t.j-l:2tj = RiHijRj"'^ , i,j = ^,.-.,K (33) 

Due to the basis constraints in Eq. (24) to (27), 

RkHkkRk^ = I2x2 (34) 

RiHijRj'^ = 02x2, i,j = l,...,K,andif^k,jf^k (35) 

By definition, Hkk = Afcfcisxs, where Xkk is a scalar. Due to Eq. (34), Xkk = 1 
and Hkk = Isxs- From Lemma 1 and Eq. (35), Hij is a zero matrix when 
i,j = 1, ..., AT, and i^k,j^k. Thus Q = GHG'^ = {gi, ...,gK)H{gi, ...,gKV = 
{0, ■■■,0, 9k,0, ■■■0){gi, ■■■, 9 kV = 9kgl- □ 



Now we have proved that, by enforcing both rotation and basis constraints, i.e. 
solving Eq. (8) to (9) and (24) to (27) by the least square methods, a closed- 
form solution, Q = Qk = 9kgJ , k = 1 , ..., K, is achieved. Then gk, k = 1, ..., K 
can be recovered by decomposing Qk via SVD. We project g').s to the common 
coordinate system and determine the corrective transformation G = {gi, ...,gi(). 
According to Eq. (5), we recover the shape bases B = G~^B, the scaled rotation 
matrix M = MG, and thus the rotations and coefficients. 



6 Performance Evaluation 

The performance of the closed-form solution is evaluated in a number of exper- 
iments. 
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Fig. 1. A static cube and 3 points moving along straight lines, (a) Input image, (b) 
Ground truth 3D shape, (c) Reconstruction by the closed-form solution, (d) Recon- 
struction by the method in [6]. (e) Reconstruction by the method in [4] after 4000 
iterations, (f) Reconstruction by the tri-linear method [13] after 4000 iterations. 



6.1 Comparison with Three Previous Methods 

We first compare the solution with three related methods [6,4,13] in a simple 
noiseless setting. Fig.l shows a scene consisting of a static cube and 3 moving 
points. The measurement consists of 10 points: 7 visible vertices of the cube 
and 3 moving points. The 3 points move along the axes at varying speed. This 
setting consists of K = 2 shape bases, one for the static cube and another for 
the moving points. Their image projections across 16 frames from different views 
are given. One of them is shown in Fig.l. (a). The corresponding ground truth 
structure is demonstrated in Fig.l.(b). Fig.l.(c) to (f) show the structures re- 
constructed using the closed- form solution, the method in [6], the method in 
[4], and the tri-linear method [13], respectively. While the closed-form solution 
achieves the exact reconstruction with zero error, all three previous methods 
result in apparent errors, even for such a simple noiseless setting. Fig. 2 demon- 
strates the reconstruction errors of the previous work on rotations, shapes, and 
image measurements. The errors are computed relative to the ground truth. 

6.2 Quantitative Evaluation on Synthetic Data 

Our approach is then quantitatively evaluated on the synthetic data. We evaluate 
the accuracy and robustness on three factors: deformation strength, number of 
shape bases, and noise level. The deformation strength shows how close to rigid 
the shape is. It is represented by the mean power ratio between each two bases, 
i.e. meariij ^ ■ Larger ratio means weaker deformation, i.e. the 
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Fig. 2. The relative errors on reconstruction of a static cube and 3 points moving along 
straight lines. (Left) By the method in [6]. (Middle) By the method in [4] after 4000 
iterations. (Right) By the trilinear method [13] after 4000 iterations. The range of the 
error axis is [0%, 100%]. Note that our solution achieves zero reconstruction errors. 



shape is closer to rigid. The number of shape bases represents the flexibility of the 

shape. A bigger basis number means that the shape is more flexible. Assuming a 

Gaussian white noise, we represent the noise strength level by the ratio between 

the Frobenius norm of the noise and the measurement, i.e. . In general, 

I|1T|| 

when noise exists, a weaker deformation leads to better performance, because 
some deformation mode is more dominant and the noise relative to the dominant 
basis is weaker; a bigger basis number results in poorer performance, because 
the noise relative to each individual basis is stronger. 

Fig. 3. (a) and (b) show the performance of our algorithm under various 
deformation strength and noise levels on a two bases setting. The power ratios are 
respectively 2°, 2^, ..., and 2®. Four levels of Gaussian white noise are imposed. 
Their strength levels are 0%, 5%, 10%, and 20% respectively. We test a number 
of trials on each setting and compute the average reconstruction errors on the 
rotations and 3D shapes, relative to the ground truth. Fig.3.(c) and (d) show 
the performance of our method under different numbers of shape bases and 
noise levels. The basis number is 2, 3, ... , and 10 respectively. The bases have 
equal powers and thus none of them is dominant. The same noise as in the last 
experiment is imposed. 

In both experiments, when the noise level is 0%, the closed- form solution 
always recovers the exact rotations and shapes with zero error. When there is 
noise, it achieves reasonable accuracy, e.g. the maximum reconstruction error 
is less than 15% when the noise level is 20%. As we expected, under the same 
noise level, the performance is better when the power ratio is larger and poorer 
when the basis number is bigger. Note that in all the experiments, the condition 
number of the linear system consisting of both basis constraints and rotation 
constraints has order of magnitude 0(10) to O(IO^), even if the basis number is 
big and the deformation is strong. Our closed-form solution is thus numerically 
stable. 
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Fig. 3. {a)&i{b) Reconstruction errors on rotations and shapes under different levels of 
noise and deformation strength. {c)&i{d) Reconstruction errors on rotations and shapes 
under different levels of noise and various basis numbers. Each curve respectively refers 
to a noise level. The range of the error axis is [0%, 20%]. 




Fig. 4. Reconstruction of three moving objects in the static background. (a)&(d) Two 
input images with marked features. (6)&(e) Reconstruction by the closed-form solution. 
The yellow lines show the recovered trajectories from the beginning of the sequence 
until the present frames. (c)&(/) Reconstruction by the method in [4]. The yellow- 
circled area shows that the plane, which should be on top of the slope, is mistakenly 
located underneath the slope. 




A Closed-Form Solution to Non-rigid Shape and Motion Recovery 585 



6.3 Qualitative Evaluation on Real Video Sequences 

Finally we examine our approach qualitatively on a number of real video se- 
quences. One example is shown in Fig. 4. The sequence was taken of an indoor 
scene by a handheld camera. Three objects, a car, a plane, and a toy person, 
moved along fixed directions and at varying speeds. The rest of the scene was 
static. The car and the person moved on the floor and the plane moved along a 
slope. The scene structure was composed of two bases, one for the static objects 
and another for the moving objects. 32 feature points tracked across 18 images 
were used for reconstruction. Two of the them are shown in Fig.4.(a) and (d). 

The rank of W was estimated in such a way that after rank reduction 
99% of the energy was kept. The basis number is automatically determined by 
K = rank(W)/3. The camera rotations and dynamic scene structure are then 
reconstructed. To evaluate the reconstruction, we synthesize the scene appear- 
ance viewed from one side, as shown in Fig.4.(b) and (e). The wireframes show 
the structure and the yellow lines show the trajectories of the moving objects 
from the beginning of the sequence until the present frames. The reconstruc- 
tion is consistent with our observation, e.g. the plane moved linearly on top of 
the slope. Fig.4.(c) and (f) show the reconstruction using the method in [4]. The 
shapes of the boxes are distorted and the plane is incorrectly located underneath 
the slope, as shown in the yellow circles. Note that occlusion was not taken into 
account when rendering these images, thus in the regions that should be oc- 
cluded, e.g. the area behind the slope, the stretched texture of the occluding 
objects appears. 

Human faces are highly non-rigid objects and 3D face shapes can be rep- 
resented as weighted combinations of certain shape bases that refer to various 
facial expressions. They thus can be reconstructed by our approach. One example 
is shown in Fig. 5. The sequence consists of 236 images that contain expressions 
like eye blinking and mouth opening. 60 feature points were tracked using an 
efficient Active Appearance Model (AAM) method [1]. Fig. 5. (a) and (d) display 
two input images with marked features. Their corresponding shapes are recon- 
structed and shown from novel views in Fig.5.(b) and (e). Their corresponding 
3D wireframe models shown in Fig.5.(c) and (f) demonstrate the recovered fa- 
cial deformations such as mouth opening and eye closure. Note that the feature 
correspondence in these experiments was noisy, especially for those features on 
the sides of face. The reconstruction performance of our approach demonstrates 
its robustness to the image noise. 

7 Conclusion and Discussion 

This paper proposes a closed-form solution to the problem of non-rigid shape 
and motion recovery from single-camera video using the least square and factor- 
ization methods. In particular, we have proven that enforcing only the rotation 
constraints results in ambiguous and invalid solutions. We thus introduce the 
basis constraints to remove this ambiguity. We have also proven that imposing 
both metric constraints leads to a unique reconstruction of the non-rigid shape 
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Fig. 5. Reconstruction of face shapes with expressions. (a)&(d) Input images. (6)&(e) 
Reconstructed face shapes seen from novel views. (c)&(/) The wireframe models 
demonstrate the recovered facial deformations such as mouth opening and eye closure. 



and motion. The performance of our algorithm is demonstrated by experiments 
on both simulated data and real video data. Our algorithm has also been suc- 
cessfully applied to separate the local deformations from the global rotations 
and translations in the 3D motion capture data [7]. 

Currently, our approach does not consider the degenerate deformation modes 
of 3D shapes. A deformation mode is degenerate, if it limits the shape to deform 
in a plane, 7e., the rank of the corresponding basis is less than 3. For exam- 
ple, if a scene contains only one moving object that moves along a straight line, 
the deformation mode referring to the linear motion is degenerate, because the 
corresponding basis (the motion vector) is of rank 1. It is conceivable that the 
ambiguity cannot be completely eliminated by the basis constraints and enforc- 
ing both metric constraints is insufficient to produce a closed-form solution in 
such degenerate cases. We are now exploring how to extend the current approach 
to recovering the non-rigid shapes that deform with degenerate modes. Another 
limitation of our approach is that we assume the weak perspective projection 
model. It would be interesting to see if the proposed approach could be extended 
to the full perspective projection model. 
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Abstract. We address the fundamental problem of matching two static 
images. Significant progress has been made in this area, but the cor- 
respondence problem has not been solved. Most of the remaining diffi- 
culties are caused by occlusion and lack of texture. We propose an ap- 
proach that addresses these difficulties within a perceptual organization 
framework, taking into account both binocular and monocular sources 
of information. Geometric and color information from the scene is used 
for grouping, complementing each other’s strengths. We begin by gener- 
ating matching hypotheses for every pixel in such a way that a variety 
of matching techniques can be integrated, thus allowing us to combine 
their particular advantages. Correct matches are detected based on the 
support they receive from their neighboring candidate matches in 3-D, 
after tensor voting. They are grouped into smooth surfaces, the projec- 
tions of which on the images serve as the reliable set of matches. The 
use of segmentation based on geometric cues to infer the color distribu- 
tions of scene surfaces is arguably the most significant contribution of 
our research. The inferred reliable set of matches guides the generation 
of disparity hypotheses for the unmatched pixels. The match for an un- 
matched pixel is selected among a set of candidates as the one that is a 
good continuation of the surface, and also compatible with the observed 
color distribution of the surface in both images. Thus, information is 
propagated from more to less reliable pixels considering both geometric 
and color information. We present results on standard stereo pairs. 



1 Introduction 

The premise of shape from stereo comes from the fact that, in a set of two 
or more images of a static scene, world points appear on the images at differ- 
ent disparities depending on their distance from the cameras. Establishing pixel 
correspondences on real images, though, is far from trivial. Projective and photo- 
metric distortion, sensor noise, occlusion, lack of texture, and repetitive patterns 
make matching the most difficult stage of a stereo algorithm. To address mainly 
occlusion and lack of texture, we propose a stereo algorithm that operates as 
a perceptual organization process in the 3-D disparity space knowing that false 
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matches will most likely occur in textureless areas and close to depth discontinu- 
ities. Since binocular processing has limitations in these areas, we use monocular 
information to overcome them. We start by detecting the most reliable matches, 
which are grouped into layers. Shape and color information from the layers is 
used to infer matches for the remaining pixels. 

The paper is organized as follows: Section 2 reviews related work; Section 3 
is an overview of the algorithm; Section 4 describes the initial matching stage; 
Section 5 the detection of correct matches using tensor voting; Section 6 the 
segmentation process; Section 7 the disparity computation for unmatched pixels; 
Section 8 contains experimental results; and Section 9 concludes the paper. 



2 Related Work 

Published research on stereo with explicit treatment of occlusion includes numer- 
ous approaches (see [1] for a comprehensive review of stereo algorithms). They 
can be categorized into the following categories: local, global and approaches with 
extended local support, such as the one we propose. Local methods attempt to 
solve the correspondence problem using local operators in relatively small win- 
dows. Kanade and Okutomi [2] use matching windows whose size and shape 
adapt according to the intensities and disparities that are included in them. In 
[3] Veksler presents a method that takes into account the average matching error 
per pixel, the variance of this error and the size of the window. 

On the other hand, global methods arrive at disparity assignments by opti- 
mizing a global cost function that usually includes penalties for pixel dissimi- 
larities and violation of the smoothness constraint. The latter introduces a bias 
for constant disparities at neighboring pixels, thus favoring frontoparallel planes. 
Global stereo methods that explicitly model occlusion include [4] [5] [6] [7] where 
optimization is performed using dynamic programming. The drawback of dy- 
namic programming is that each epipolar line is processed independently, which 
results in “streaking” artifacts in the output. Consistency among epipolar lines 
is ensured by using graph cuts to optimize the objective function. Ishikawa and 
Geiger [8] explicitly model occlusion in a graph cut framework, but their algo- 
rithm is limited to convex energy functions which do not perform well at discon- 
tinuities. Kolmogorov and Zabih [9] advance the graph cut matching framework 
by proposing an optimization technique that is applicable to more general ob- 
jective functions and obtains very good results. 

Between these two extremes are approaches that are neither “winner-take- 
all” at the local level, nor global. They start from the most reliable matches 
to estimate the disparities of less reliable ones. Many authors [10] [11] use the 
support and inhibition mechanism of cooperative stereo to ensure the propaga- 
tion of correct disparities and the uniqueness of matches with respect to both 
images. Reliable matches without competitors are used to reinforce matches 
that are compatible with them and eliminate the ones that contradict them, 
progressively disambiguating more pixels. Zhang and Kambhamettu [12] extend 
the cooperative framework from single pixels to segmented surfaces, in the form 
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of small locally planar patches. A different method of aggregating support is 
nonlinear diffusion, proposed by Scharstein and Szeliski in [13], where disparity 
estimates are propagated to neighboring pixels until convergence. Sun et al. [14] 
formulate the problem as an MRF with explicit handling of occlusions. In the be- 
lief propagation framework, information is passed to adjacent pixels in the form 
of messages whose weight also takes into account image segmentation. Other 
progressive approaches include Szeliski and Scharstein [15] and Zhang and Shan 
[16] who start from the most reliable matches and allow the most certain dis- 
parities guide the estimation of less certain ones, while occlusions are explicitly 
labeled. 

The final class of methods reviewed here are based on image segmentation. 
Birchfield and Tomasi [17] cast the problem of correspondence as image segmen- 
tation followed by the estimation of an affine transformation for each segment 
between the images. Tao et al. [18] introduce a stereo matching technique where 
the goal is to establish correspondence between image regions rather than pix- 
els. Both these methods are limited to planar surfaces, unlike the one of [12] 
which was described above. Lin and Tomasi [19] propose a framework where 3-D 
shape is estimated by fitting splines, while 2-D support is based on image seg- 
mentation. Processing alternates between these two steps until convergence. As 
mentioned above, in [14] image segmentation is a soft constraint, since messages 
can be passed between different image segments with a lower weight. All of these 
approaches, however, address color segmentation independently of disparity. 

The perceptual organization stage of the approach we propose here is based 
on the work of Lee et al. [20], which was later extended to multiple views in [21]. 
However, there are significant differences in the way initial matches are generated 
and, most importantly, in the integration of monocular cues to specifically ad- 
dress occlusion and lack of texture. The approach in [20] has a less sophisticated 
initial matching scheme, the failures of which cannot always be corrected. In ad- 
dition, the post-processing mechanism based on edge detection it proposes is not 
as effective against occlusion as the approach presented here. On the other hand, 
information propagation in 3-D and the use of surface saliency as the criterion 
for the selection of pixel correspondences remain cornerstones of our approach. 

3 Algorithm Overview 

The proposed algorithm has four steps, which are illustrated in Fig. 1, for the 
“Sawtooth” stereo pair (courtesy of [1]). 

— The input to the first stage is a pair of images which we assume have been 
rectified so that conjugate epipolar lines are parallel and share the same y 
coordinate. The goal is the generation of matching hypotheses for every pixel 
and it is accomplished with three different matching techniques. The output 
is a set of points in 3-D disparity space (Fig. 1(b)). 

— Next is the tensor voting stage, during which the unorganized point cloud 
from the previous stage is encoded in the form of second order symmetric 
tensors which cast votes to their neighbors. Salient matches can be detected 
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(d) Layer labels (e) Final disparities 



(c) Sparse disparities 



(f) Error map 



(a) Left image 



(b) Initial matches 



Fig. 1. Overview of the processing steps for the “Sawtooth” dataset. The initial 
matches have been rotated so that the multiple candidates for each pixel are visi- 
ble. Black pixels in the error map indicate errors greater than 1 disparity level, gray 
pixels correspond to errors between 0.5 and 1 disparity level, while white pixels are 
correct (or occluded and thus ignored) 



based on the amount of support they receive from their neighbors. Unique- 
ness is also enforced at the end of this stage with respect to surface saliency 
and not a local measure, such as cross-correlation, which is more susceptible 
to noise. The output, which we term “sparse disparity map”, consists of at 
most one match for each pixel of the reference image, which has an associ- 
ated surface saliency value and an estimate of surface orientation. It can be 
seen in Fig. 1(c). This part of the algorithm is based on our previous work, 
published in [20]. 

— The outputs of the tensor voting are grouped, using the estimated surface 
orientations, into smooth layers. These are refined by removing those 3-D 
points that correspond to pixels that are inconsistent with the layer’s color 
distribution. This addresses the usual problem of surface over-extension that 
occurs near occlusions. The over-extensions are usually not color-consistent 
and are removed at this stage. Thus we derive the set of reliable matches. 
Please note that the term layer throughout this paper is used interchangeably 
with surface, since by layer we mean a smooth, but not necessarily planar, 
surface in 3-D disparity space (x,y,d), where d denotes disparity. The label 
of each pixel can be seen in Fig. 1(d). 

— The last module starts from a set of segmented surfaces and computes dis- 
parities for unmatched pixels. Disparity candidates are generated from the 
nearby layers, to which the pixel may belong based on its color. These are 
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also validated in the right image and the final disparity is selected as the one 
that is a smooth continuation of the most likely layer. The output of this 
stage is a dense disparity map with one disparity estimate for every pixel of 
the reference image including the occluded ones (Fig. 1(e)). Disparity esti- 
mation for occluded pixels is possible since the surfaces can be extrapolated 
using tensor voting even if they are occluded. 

The algorithm is applied on the four datasets proposed in [1] and the two pro- 
posed in [22], which are also available online at 

http : / /www . middlebury . edu/ stereo . 

Quantitative results are presented in Section 8. 



4 Initial Matching 

A large number of matching techniques have been proposed in the literature [1]. 
We propose a scheme for combining heterogeneous matching techniques, thus 
taking advantage of their combined strengths. For the results presented in this 
paper, three matching techniques are used, but any kind of matching can be 
integrated in the framework. The techniques used here are: 

— A 5 X 5 normalized cross correlation window, which is small enough to capture 
details and only assumes constant disparity for small parts of the image. 

~ A 35 X 35 normalized cross correlation window, which is applied only at pixels 
where the standard deviation of the three color channels is less than 20. The 
use of such a big window over the entire image would be catastrophic, but it 
is effective when applied only in virtually textureless regions, where smaller 
windows completely fail to detect correct matches. 

— A 7 X 7 symmetric interval matching window with truncated cost function as 
in [15]. The images are linearly interpolated along the x-axis so that samples 
exist in half-pixel intervals. The cost for matching pixel {xi^,y) in the left 
image with pixel {xR,y) in the right image is: 



C{xL,XR,y) = '^min{dist{lLc{xi,y), lRc(xj,y)) : 

C 

Xz&[xL-^ XL+^],Xj e[xR-^ XR+]^]} (1) 

The summation is over the three RGB color channels and dist{) is the Eu- 
clidean distance between the value of a color channel Jlc in the left image 
and Irc in the right image. If the distance for any channel exceeds a preset 
truncation parameter trunc, the total cost is set to 3 x trunc. This technique 
is effective near discontinuities due to the robustness of the cost function to 
pixels from different surfaces. Typical values for trunc are between 3 and 10. 

Each matching technique is repeated using the right image as reference and 
the left as target. This increases the true positive rate especially near discon- 
tinuities, where the presence of occluded pixels in the reference window affects 
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the results of matching. When the other image is used as reference, these pixels 
do not appear in the reference window. 

The maximum matching score, or the minimum cost, for every pixel is re- 
tained as a matching hypothesis. Matching scores and costs are then discarded 
and each hypothesis is treated equally in the following stage. A simple parabolic 
fit [1] is used for subpixel accuracy, mainly because it makes continuous slanted 
or curved surfaces appear continuous and not staircase-like. Computational com- 
plexity is not affected since the number of matching hypotheses is unchanged. 
Besides the increased number of correct detections, the combination of these 
matching techniques offers the advantage that the failures of a particular tech- 
nique are not detrimental to the success of the algorithm. The 35 x 35 window 
is typically applied to very small uniform parts of the image and never near 
discontinuities, where color exhibits some variance. Our experiments have also 
shown that the errors produced by small windows, such as the 5x5 and 7x7 
used here, are randomly spread in space and do not usually align to form non- 
existent structures. This property is important for our methodology that is based 
on the perceptual organization, due to “non-accidental alignment”, of candidate 
matches in space. 

5 Detection of Correct Matches 

This section describes how correct matches can be found among the hypotheses of 
the previous stage by examining how they can be grouped with their neighboring 
candidate matches to form smooth 3-D surfaces. This is accomplished by tensor 
voting, which also allows us to infer the orientation of these surfaces. 

5.1 Overview of Tensor Voting 

The use of a voting process for structure inference from sparse and noisy data was 
presented in [23]. The methodology is non-iterative and robust to considerable 
amounts of outlier noise. It has one free parameter: the scale of voting, which 
essentially defines the size of the neighborhood of each point. The input data is 
encoded as second-order symmetric tensors, and constraints, such as proximity, 
co-linearity and co-curvilinearity are propagated by voting within the neighbor- 
hood. The tensors allow the representation of points on smooth surfaces, surface 
intersections, curves and junctions, without having to keep each type in separate 
spaces. In 3-D, a second-order tensor has the form of an ellipsoid, or equivalently 
of a 3 X 3 matrix. Its shape encodes the type of feature that it represents, while 
its size the saliency or the confidence we have in this information (Fig. 2(a)). 

The tensors are initialized as unitary matrices, since no information about 
their preferred orientation is known. During the voting process, each input site 
casts votes to its neighboring input sites that contain tokens. The votes are also 
second-order symmetric tensors. Their shape corresponds to the orientation the 
receiver would have, if the voter and receiver were in the same structure. The 
saliency (strength) of a vote cast by a unitary stick tensor decays with respect 
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to the length of the smooth circular path connecting the voter and receiver, 
according to the following equation: 

5 (s,K,a) = e ^ ^ ^ ( 2 ) 

Where s is the length of the arc between the voter and receiver, and n is its 
curvature (see Fig. 2 (b)), a is the scale of voting, and c is a constant. The votes 
cast by un-oriented voters can be derived from the above equation, but this is be- 
yond the scope of this paper. Vote accumulation is performed by tensor addition, 
which is equivalent to the addition of 3 x 3 matrices. After voting is completed, 
the eigensystem of each tensor is analyzed and the tensor is decomposed as in: 

T = AiCiC^ -|- \ 2 G 2 ^ + ^3636^ = 

= (Ai — A2)eie^ -I- (A2 — A3)(eie^ -I- 620^) -I- A3(eic^ -I- 626^ -I- ( 3 ) 

where Xi are the eigenvalues in decreasing order and are the corresponding 
eigenvectors. The likelihood that a point belongs to a smooth perceptual struc- 
ture is determined as follows. The difference between the two largest eigenval- 
ues encodes surface saliency, with a surface normal given by ei. The difference 
between the second and third eigenvalue encodes curve saliency, with a curve 
tangent parallel to 63. Finally, the smallest eigenvalue encodes junction saliency. 
If surface saliency is high, the point most likely belongs on a surface and ei is its 
normal. Outliers that receive no or inconsistent support from their neighborhood 
can be identified by their low saliency and the lack of a dominant orientation. In 
the case of stereo, we assume that that all inkers lie on surfaces that reflect light 
towards the cameras, and therefore we do not consider curves and junctions. 





(c) Voting in 3-D 



Fig. 2. Tensor Voting, (a) The shape of the tensor indicates if there is a preferred 
orientation, while its size the confidence of this information. The top tensor has a 
strong preference of orientation and is more salient than the bottom tensor, which is 
smaller and un-orlented. (b) Vote generation as a function of the distance and curvature 
of the arc and the orientation of the voter, (c) Voting in 3-D neighborhoods eliminates 
interference between adjacent pixels from different layers 
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5.2 Detection of Matches as Surface Inliers 

The goal of this stage is to address stereo as a perceptual organization problem 
in 3-D, based on the premise that the correct matches should form coherent 
surfaces in the 3-D disparity space. This is the only part of our approach that 
is based on [20]. The input is a cloud of points in a 3-D space (x, y, zscale x d), 
where zscale is a constant used to make the input less flat with respect to the 
d-axis, since is disparity space is usually a lot flatter than actual (x,y,z). Its 
typical value is 8 and the sensitivity is extremely low for a reasonable range such 
as 4 to 20. The quantitative matching scores are disregarded and all candidate 
matches are initialized as un-oriented tensors with saliency (confidence) 1. If two 
or more matches fall within the same (x, y, zscale x d) voxel their initial saliencies 
are added, thus increasing the confidence of candidate matches confirmed by 
multiple matching techniques. 

After the inputs have been encoded as tensors, they cast votes to their neigh- 
bors. The voting neighborhood includes all locations at which the strength of the 
votes is at least 2.5% of the voter’s saliency. Therefore, its size is a function of cr 
from Eq. 2. What should be pointed out here is the fact that since information 
propagation is performed in 3-D there is very little interference between candi- 
date matches for pixels that are adjacent in the image but come from different 
surfaces (see Fig. 2(c)). This is a big advantage over information propagation 
between adjacent pixels, even if it is mitigated by some dissimilarity measure. 

Once voting is completed, the results can be analyzed and the surface saliency 
of every candidate match can be computed as in Eq. 3. Uniqueness is enforced 
with respect to the left image by retaining the candidate with the highest surface 
saliency for every pixel. We do not enforce uniqueness with respect to the right 
image since it is violated by slanted surfaces which project to a different number 
of pixels on each image. Since the objective is disparity estimation for every 
pixel in the reference image, uniqueness applies to that image only. The fact 
that a candidate match has no competition for a given pixel does not necessarily 
indicate that it is correct, since the correct match could have been missed at the 
first stage. Therefore, candidate matches with low surface saliency are rejected 
even if they satisfy uniqueness. Surface saliency is a more reliable criterion for 
the selection of correct matches than the score of a local matching operator, 
because it requires that candidate matches, identified as such by local operators, 
should also form coherent surfaces in 3-D. This scheme is capable of rejecting 
false positive responses of the local operators, which is not possible at the local 
level. Based on the datasets we use, good results are achieved when the least 
salient candidates are gradually rejected until disparity estimates remain for 
about 70-80% of the pixels. In the data set, which we call the “sparse disparity 
map” , remain matches with high surface saliency, which also satisfy uniqueness. 

6 Segmentation into Layers 

Surface inliers are segmented into layers using a simple growing scheme. By 
layers we mean surfaces with smooth variation of surface normal. Therefore, the 
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layers do not have to be planar and the points that belong to them do not have 
to form one connected component. Labeling starts from seed matches that have 
maximum surface saliency by examining matches within a certain distance in 
3-D for compatibility in terms of surface normals as in Fig. 3(a). If a smooth 
surface that goes through the seed and the match under consideration exists, 
then the point is added to the layer. Further comparisons for the addition of 
more points to a layer are made between unlabeled points and the points from 
the layer that are closer to them. For all the experiments presented in this paper 
the grouping criteria are: cos{6i) < 0.95 and max{cos{92) , cos{9^)} < 0.08. 
The search region, which is a non-critical parameter, is set equal to the voting 
neighborhood size. Since we do not attempt to fit global surface models, our 
grouping scheme performs equally well when the scene surfaces deviate from 
planar or quadric models. 

To derive the reliable set of matches, one additional step is necessary to 
remove possible contamination from the layers due to surface over-extension 
from the initial matching stage. The colors of all points assigned to a layer are 
examined for consistency with the layer’s local color distribution and the outliers 
are removed from the layer. Color consistency of a pixel is checked by computing 
the ratio of pixels of the same layer with similar color to the current pixel over 
the total number of pixels of the layer within the neighborhood. This is repeated 
for every layer on both images and if the current assignment does not correspond 
to the maximum ratio in both images, then the pixel is removed from the layer. 
The color similarity ratio for pixel (xo,yo) in the left image with layer i can be 
computed according to the following equation: 



^i(xo,yo) 



= iANDdist{lL{x,y),Ii{xo,yo) < Cthr)) 

T.(a:,y)eNT{lab{x,y) = i)) 



( 4 ) 



Where T() is a test function that is 1 if its argument is true, lab{) is the label of a 
pixel and Cthr is a color distance threshold in RGB space, typically 10. The same 
is applied for the right image for pixel {xg — do, yg). Rejected pixels are not added 
to the layer with the maximum color similarity since they are not geometrically 
consistent with that layer. Layers with a very small number of points, such as 
0.5% of the number of pixels, are also rejected. This addresses the usual problem 
of surface over-extension that occurs near occlusions, since occluded pixels can 
be erroneously assigned the disparity of the foreground, due to the absence of 
a visible correspondence in the other image. The over-extensions, however, are 
usually not color-consistent and are removed at this stage. 

Our reliable set of matches is in the form of these layers which consist of 
matches that are unique with respect to the left image, have high surface saliency, 
and are both geometrically and photometrically consistent with their neighbors. 
Quantitative evaluation for the reliable sets of matches is presented in Table 1. 
The error metric used is the one proposed in [1], where matches are considered 
erroneous if they correspond to un-occluded image pixels and their disparity 
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error is greater than one integer disparity level. Compared to similar results 
published in [24] [25] [26] , our method outperforms [24] and [25] and is inferior to 
[26] which, however, assumes constant disparity for the dense features it detects. 
Also, Szeliski and Scharstein [15] report an error rate for the reliable matches for 
the Tsukuba dataset of 2.1% for 45% density which rises to 4% for 73% density. 



Table 1. Quantitative evaluation of density and error rate for the Middlebury stereo 
evaluation datasets 



Method 


Tsukuba 

error 


density 


Sawtooth 
error density 


Venus 

error 


density 


Map 

error 


density 


Our results 


1.18% 


74.5% 


0.27% 


78.4% 


0.20% 


74.1% 


0.08% 


94.2% 


Sara [24] 


1.4% 


45% 


1.6% 


52% 


0.8% 


40% 


0.3% 


74% 


Veksler [25] 


0.38% 


66% 


1.62% 


76% 


1.83% 


68% 


0.22% 


87% 


Veksler [26] 


0.36% 


75% 


0.54% 


87% 


0.16% 


73% 


0.01% 


87% 





Fig. 3. (a) Surface compatibility test for surface segmentation, (b) Candidate genera- 
tion for unmatched pixels based on segmented layers. Note that only matches from the 
appropriate layer vote at each candidate 



7 Surface Growth 

The goal of this module is to generate candidate matches for the unmatched 
pixels. Given the already estimated disparities and labels for a large set of the 
pixels, there is more information available now that can enhance our ability to 
estimate the missing disparities. Color similarity ratios are computed for each 
unlabeled pixel (x, y) as in Eq. 4, for all layers within the neighborhood. All 
ratios are normalized by their sum and layers with high normalized ratios are 
considered as possible surfaces for the pixel under consideration. For each can- 
didate layer a range of potential disparities is estimated from pixels of the layer 
neighboring (x, y). The range is extended according to the disparity gradient 
limit constraint, which holds perfectly in the case of rectified parallel stereo 
pairs. These disparity hypotheses are verified on the target image by repeating 
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the same process, unless they are occluded, in which case we allow occluding 
surfaces to grow underneath the occluding ones. Votes are collected at valid po- 
tential matches in disparity space, as before, with the only difference being that 
only matches from the appropriate layer cast votes (see Fig. 3(b)). The most 
salient among the potential matches is selected and added to the layer, since it 
is the one that ensures the smoothest surface continuation. 

Finally, there are a few pixels that cannot be resolved because they exhibit 
low similarity to all layers, or because they are specular or in shadows. Candi- 
dates for these pixels are generated based on the disparities of all neighboring 
pixels and votes are collected at the candidate locations in disparity space. Again, 
the most salient ones are selected. We opted to use surface smoothness at this 
stage instead of image correlation, or other image based criteria, since we are 
dealing with pixels where the initial matching and color consistency failed to 
produce a consistent match. 

8 Experimental Results 

This section contains results on the color versions of the four datasets of [1] and 
the two proposed in [22]. The initial matching in all cases was done using the 
three matching techniques presented in Section 4. The scale of the voting field 
was cr^ = 100 (except for Tsukuba, where it was 50) which corresponds to a vot- 
ing radius of 20, or a neighborhood of 41 x 41 x 41. Layer segmentation was done 
using the thresholds of Section 6 and the color distance threshold Cthr was set to 
10. The error metric used is the one proposed in [1], where matches are consid- 
ered erroneous if they correspond to un-occluded image pixels and their disparity 
error is greater than one integer disparity level. Table 2 contains the error rates 
we achieved, as well as the rank our algorithm would achieve among the 27 algo- 
rithms in the evaluation. Due to lack of space we refer readers to the Middlebury 
College evaluation webpage (http://www.middlebury.edu/stereo) for results ob- 
tained by other methods. Based on the overall results for unoccluded pixels, our 
algorithm would rank first in the evaluation at the time of submission. 



Table 2. Quantitative evaluation for the original Middlebury stereo datasets 



Dataset 


Unoccluded 


Untextured 


Discontinuities 




error 


rank 


error 


rank 


error 


rank 


Tsukuba 


2.19% 


10 


0.92% 


5 


11.93% 


11 


Sawtooth 


0.53% 


4 


0% 


1 


4.91% 


6 


Venus 


0.36% 


1 


0.16% 


2 


5.00% 


4 


Map 


0.33% 


9 


- 


- 


4.69% 


10 



Table 3 reports results for the two datasets of [22] and results of three stereo 
algorithms, sum of squared differences (SSD), dynamic programming (DP) and 
graph cuts (GC) implemented by the authors of [22]. To our knowledge, our 
results are the best for these datasets. 
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Table 3. Quantitative evaluation for the new Middlebury stereo datasets 



Dataset 


Our result 


SSD 


DP 


GC 


Cones 


5.57% 


17.8% 


17.1% 


12.6% 


Teddy 


9.10% 


26.5% 


30.1% 


29.3% 




Fig. 4. Left images, final disparity maps and error maps for the “Venus”, “Tsukuba”, 
“Cones” and “Teddy” datasets from the Middlebury Stereo evaluation 



9 Discussion 

We have presented a novel stereo algorithm that addresses the limitations of 
binocular matching by incorporating monocular information. We use tensor vot- 
ing to infer surface saliency and use it as a criterion for deciding on the cor- 
rectness of matches as in [20] and [21]. However, the quality of the experimental 
results depends heavily on the inputs to the voting process, that are generated 
by the new initial matching stage, and the notion of geometric and photomet- 
ric consistency we have introduced for the layers. Careful initial matching and 
the use of smoothness with respect to both surface orientation and color com- 
plement each other to derive more information from the stereo pair. Textured 
pixels are typically resolved by binocular matching, while untextured ones by the 
smooth extension of neighboring surfaces guided by color similarity. Arguably 
the most significant contribution is the segmentation into layers based on geo- 
metric properties and not appearance. We claim that this is advantageous over 
other methods that use color-based segmentation, since it utilizes the already 
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computed disparities which are powerful cues that provide very reliable initial 
estimates for the color distribution of layers. 

Other contributions include the initial matching stage that allows the in- 
tegration of any matching technique without any modification to subsequent 
modules. Information propagation in 3-D via tensor voting eliminates interfer- 
ence between adjacent pixels from different world surfaces. The proposed color 
similarity model works very well, despite its simplicity, because, locally, similar 
colors tend to belong to the same layer. The choice of a local non-parametric 
color representation allows us to handle surfaces with heterogeneous and vary- 
ing color distributions, such as the ones in the Venus dataset, on which image 
segmentation may be hard. An important contribution of this scheme is the 
elimination of over-extending occluding surfaces. Finally, the implicit assump- 
tion that scene surfaces are frontoparallel is only made in the initial matching 
stage, when all pixels in a small window are assumed to have the same dispar- 
ity. After this point, the surfaces are never assumed to be anything other than 
continuous. 

The algorithm is able to smoothly extend partially visible surfaces to infer the 
disparities of occluded pixels, but fails when entire surfaces are only monocularly 
visible, or when occluded surfaces abruptly change orientation. It also fails when 
objects are entirely missed and are not included in the set of reliable matches. 
Over or under-segmentation is not catastrophic. For instance a segmentation of 
the Venus dataset into three instead of the correct four layers yields an error 
rate of 0.63%. 
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Abstract. We consider the problem of estimating the 3D shape and 
reflectance properties of an object made of a single material from a 
calibrated set of multiple views. To model reflectance, we propose a 
View Independent Reflectance Map (VIRM) and derive it from Torrance- 
Sparrow BRDF model. Reflectance estimation then amounts to estimat- 
ing VIRM parameters. We represent object shape using surface trian- 
gulation. We pose the estimation problem as one of minimizing cost of 
matching input images, and the images synthesized using shape and re- 
flectance estimates. We show that by enforcing a constant value of VIRM 
as a global constraint, we can minimize the matching cost function by 
iterating between VIRM and shape estimation. Experiment results on 
both synthetic and real objects show that our algorithm is effective in re- 
covering the 3D shape as well as non-lambertian reflectance information. 
Our algorithm does not require that light sources be known or calibrated 
using special objects, thus making it more flexible than other photomet- 
ric stereo or shape from shading methods. The estimated VIRM can be 
used to synthesize views of other objects. 



1 Introduction 

Many multiple-view algorithms have been proposed over the years for 3D recon- 
struction. These algorithms can be generally classified into image centered or 
object/scene centered. Image centered algorithms [1] first search for pixel cor- 
respondences followed by triangulation. Object/scene centered approaches are 
another category that has been explored recently [2,3]. A model of the object 
or scene is built and a consistency function is defined over the input images; 
maximizing the function achieves a 3D model that is most consistent with all 
the input views. In each approach, objects are frequently assumed to have Lam- 
bertian reflectance to facilitate finding correspondences. One exception is the 
radiance tensor field introduced by Jin, et al [3]. They propose a rank constraint 
of radiance tensor to recover the 3D shape. This is essentially a local reflectance 
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acknowledged. Tianli Yu was supported in part by a Beckman Institute Graduate 
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constraint to model both lambertian and non-lambertian objects. However, con- 
structing the radiance tensor requires that every scene point be seen by a sub- 
stantial number of cameras. In addition, the estimates obtained by most of these 
algorithms are confined to individual pixels and they usually cannot recover fine 
details of the shape, e.g., those encoded by shading. 

Shape from shading algorithms, on the other hand, have the potential to 
recover greater details about surface shape, e.g., surface normal changes from 
image shading. However, shape from shading algorithms are usually developed 
for constrained environments, such as single material objects, lambertian re- 
flectance, single viewpoint, known or very simple light source, orthographic pro- 
jection, and absence of shadows and interreflections. Zhang, et al.[4] present a 
recent survey of shape from shading methods. Samaras, et al.[5] propose to in- 
corporate shape from shading method into multiple-view reconstruction. They 
consider lambertian objects and recover piece- wise constant albedo as well as sur- 
face shape. In their method, specularities are detected and removed. Although, 
for lambertian objects complex lighting can be well modeled locally using a sin- 
gle point light source, this is not the case for specular objects. Hertzmann and 
Seitz [6] use a calibration sphere together with the object to obtain a reflectance 
map that can be used to recover the shape. Their approach works with a single 
view and can deal with multiple non-lambertian materials as well as unknown 
lighting; it however requires placement of calibration objects in the scene and 
change of lighting. 

The approach we present in this paper is object centered and extends the 
work on shape from shading to allow non-lambertian surface reflectance, un- 
controlled lighting, and the use of multiple views. We focus on single material 
objects, and assume that light sources are distant and there are no shadows 
or interreflection effects. Our approach does not require the knowledge of light 
sources or light calibration tools. In fact, the object itself serves as the cali- 
bration source. We show that by imposing a global lighting constraint, we can 
recover the 3D shape of the object, as well as a view-independent reflectance 
map (VIRM) which allows us to render from any view point the same or any 
other object, made of the same material and under the same lighting. 

This paper is organized as follows: Section 2 formulates the problem as a 
minimization problem. Section 3 derives the VIRM. Section 4 presents our es- 
timation algorithms. Experimental results on both synthetic and real data are 
given in Section 5. Section 6 presents conclusions and extensions. 



2 Problem Formulation 

Our objective is to reconstruct the 3D shape and reflectance information from 
multiple images of an object, given images of the object from different view- 
points, the intrinsic and extrinsic camera parameters for each image, and the 
knowledge that the object is made of a single material. This problem can be 
posed as that of minimization of the differences between the input images and 
the images synthesized using the underlying shape, lighting and BRDF model. 
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Suppose the surface of the object is where V is the parameter vector of 

the object shape. The BRDF of the surface is denoted as p(0', 0'^, , where 

(0', (0Q, are polar and azimuthal angles of the distant light direction and 

viewing direction in the local surface coordinates. Consider a patch P on S, small 
enough so that the surface normal remains nearly constant over the entire patch. 
The brightness of the patch when viewed from a certain direction <p'o) can 
be computed by multiplying the BRDF with the foreshortened lighting distri- 
bution and integrating the product over the upper hemisphere of the 

patch, as in (1): 



R{0'o,€) = I |p(0^<^',0:,,</>;)T(0^<^Ocos0'sin0'd« (1) 




Fig. 1. Project a patch on the surface onto image planes 

Given the shape 5(R), BRDF model p and lighting L, we can synthesize 
the images of the object using (1) as follows. Let iTj : denote the 

perspective projection that maps the 3D world coordinates onto a 2D image 
plane corresponding to the jth view. For each P, let Oj = TTj{P) be the projection 
of P onto jth input image (Fig. 1) . If P is visible in jth view, then we can compute 
the intensity value of Oj in the synthesized jth view using (1). Our goal is to 
estimate the model parameters, V, p and L, that minimize the difference between 
the input images and these synthesized images: 



(T, Pi T) — arg min (dsyn 7 ^input ) 



( 2 ) 



where Fmatching denotes the matching cost function between input images and 
synthesized images. It is defined as the sum of all intensity differences between 
the corresponding patches in the input and synthesized images: 



Rmatchingi^I syn^ ^ input) — 

’ ^input) 

3 

= E E (3) 



f orallPonS Pisvisibleinj 
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where D{Isyn,l\nput) is the difference between an entire synthesized image and 
entire input image and d(-, •) is the analogous difference between image patches. 
Oj{I) is the set of pixels covered by patch Oj in image I. We will use the following 



), O, (/£„,)] = {RyiP) - mean[0, (/«,,)]} • n(0,) (4) 

where Rj{P) is the reflectance of P in jth image computed from (1), mean[-] 
is the average pixel value in the patch, and n(-) is the number of pixels in the 
patch. 



3 View Independent Reflectance Map 



Reflectance map is used in shape from shading research to give the mapping 
between surface normal and the brightness value viewed from a certain direc- 
tion. It avoids the separate estimation of the lighting and BRDF, yet contains 
enough information to recover shape from shaded images. However, reflectance 
map is viewpoint dependent, which makes its use inconvenient for multiple- view 
algorithms. Ramamoorthi and Hanrahan [8] point out that given a shape, there 
is an inherent ambiguity when one tries to fully recover the BRDF p and lighting 
L. A blurred light source and a sharp BRDF lead to the same results as a sharp 
light source and a low-pass BRDF. We use this property to model the specular 
light reflected by a BRDF as the same light passing through a circular symmet- 
ric low-pass filter and then reflected by a perfect mirror. Based on this idea, we 
introduce the notion of View-Independent Reflectance Map (VIRM) which we 
use to represent the combined effects of lighting L and BRDF p independent of 
the viewpoints. In this section we show that we can derive VIRM by separating 
the diffuse and specular parts of refiectance. 

As mentioned in Section 2, the brightness value of a surface point can be 
computed from (1). Specifically, we can use the Torrance-Sparrow microfacet 
model [7] as the BRDF model and simplify it to derive our VIRM. According to 
the model, the BRDF of a material can be written as: 



</'o) = e)=Kd + K, 



F{p, n, I, e)G(n, I, e)D{a, n, I, e) 
4(i • n)(e • n) 



(5) 



where n, I and e are surface normal, light direction and viewing direction vectors. 
F{p,n,l,e) is the Fresnel term, related to the material’s index of refraction 
p. G{n,l,e) is the geometric attenuation term. D{a,n,l,e) is the microfacet 
normal distribution function described below. The refiectance value when a patch 
is illuminated by a directional source L is given by 



R{n, e, L, p) = \L\ ■ [Kd{l -n)+ K, 



F{p, n, I, e)G{n, I, e)D{a, n, I, e) 
4(e • n) 



(6) 



where L = \L\ ■ I is the light vector for the directional source. 
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For simplicity we assume F and G to be constant and absorb them into Kg ■ 
Now let us consider the microfacet normal distribution function. A simple form 
of D is 

D{a, n, I, e) = — ^ exp - 

TTCT^ \ 




cos 9h = n-h 



( 7 ) 



where h is the mid-vector between I and e (Fig. 2) and a is the variance of the 
microfacet normals. Let us take the mirror image of viewing direction e with 
respect to the surface normal and denote it as the reflection vector r, as in 
Fig. 2. If the light direction I is co-plane with the surface normal n and viewing 
direction e, we will have 

0ri = 29h (8) 



where 6ri is the angle between the reflection vector r and the light direction 
vector I (Fig. 2). Substituting (8) into (7), and denoting it as D, we get: 



D{a,9ri) = ^^exp 

TTCT^ 




(9) 




Fig. 2. Reflection vector r and mid- vector h 



Generally, D is not symmetric around r. So strictly speaking, D ^ D when 
I deviates from the plane determined by e and n. However, Ramamoorthi and 
Hanrahan [8] point out that when viewing angle is small, assuming D is sym- 
metric around r is a good approximation. Under this assumption, we can use 
D{a, 9ri), which is a function of a and 9^i to approximate D. Now the reflectance 
value in (6) is 



R{n,e,L,p) = \L\-KS-n)+Kg&^^^ (10) 

In (10) the first term is the diffuse part, and the second term is the specu- 
lar part. If all the patches have the same material and the lighting is constant 
with respect to the world coordinate system (e.g. all the surface patches are illu- 
minated under the same lighting), the diffuse term depends only on the surface 
normal n, and the specular term depends only on 9j-i and the viewing angle e-n. 
Furthermore, in the specular term, we can merge \L\ and 9ri) together and 
view it as the result of Altering the single directional light source with a circular 




Shape and View Independent Reflectance Map from Multiple Views 607 



symmetric function D. Since the light source is fixed, the merged term depends 
only on r and we denote it as Rs{r). Similarly, the first term on the right side 
of (10) depends only on n and is denoted as Rd{n). So (10) becomes: 

R{n,e,L,p) = Rd{n) + (11) 

n ■ e 

Meanwhile, since r is the mirror vector of e, the right side of equation (11) only 
depends on e and n. Equation (11) gives a very compact way to represent the 
reflectance of a surface patch under fixed lighting. It is just a linear combina- 
tion of two components, the diffuse part and the specular part, and each can be 
represented as a 2D function (since n and r are both unit vectors). The approx- 
imation is derived under single directional light source assumption, but it can 
be extended to the cases of multiple directional light sources since both distant 
illumination model and the circular symmetric filtering are linear operations. 

The simplified model in (11) implies that if we can estimate the diffuse and 
specular distributions Rd and Rs, we can compute the reflectance of any point 
given its surface normal and viewing direction. We call Rd and Rg the diffuse 
and specular components of the VIRM. They serve the same roles as reflectance 
map in single view shape from shading. 

If we assume that all the surface patches have the same BRDF and the 
lighting remains constant, then the VIRM is constant for all the patches and 
viewing directions. This is equivalent to a global constraint over all the surface 
patches and input views. By using VIRM as our reflectance model, we can write 
(2) as: 

(R, Rf/, Rs) — Rmatchingi^^syn^ ^input) (^^) 

However, we should point out that when there are local variations of lighting such 
as due to a non-distant light sources, self-shadowing or inter-reflection, VIRM 
will not necessarily be constant. Our derivation of VIRM makes the assumption 
that F and G in (6) are constant and D can be approximated by D, both 
assumptions require the viewing angle away from 7t/2 to approximates accurately 
the Torrance-Sparrow model. 

4 Algorithm and Implementation 

In this section we present the various aspects of the algorithm we have used to 
implement the approach described in Section 2 and 3. 

4.1 Data Representation 

We use a triangular mesh to represent the object surface, where each triangle 
serves the role of patch P in (4), and the 3D positions of all the vertices in 
the mesh are the shape parameters V. VIRM is represented by a collection 
of samples of the diffuse and specular distribution functions. We choose the 
longitude-latitude grid to sample the azimuthal and polar components at a fixed 
angular interval. Function values that are not on the sampling grid are computed 
using cubic interpolation. 
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4.2 Iterative Optimization 

Equation (12) defines a nonlinear optimization problem with a large number of 
parameters to be chosen. However, note that the VIRM parameters are only 
linearly constrained. If we fix all the shape parameters, estimating the optimal 
VIRM is just a constrained linear least squares problem. Because of this, we 
choose to optimize the shape and VIRM parameters separately and interleave 
these optimization processes, as illustrated in Fig. 3. 



Camera 

Parameters 



Images from 
multiple 
viewpoints 



Initial Shape: 
Visual Hull 




View 

Independent 
Reflectance Map 



3D Shape 



-► Data Flow 



Control 

Flow 



Fig. 3. Flow chart of the iterative optimization algorithm 



The inputs to our algorithm are the object images taken from different view- 
points and the corresponding camera parameters. A coarse visual hull is com- 
puted from the silhouettes of the object (silhouettes can be obtained by seg- 
mentation or background subtraction) and used as the initial shape for the first 
VIRM optimization. During VIRM optimization, we fix all the shape parameters 
and find an optimal VIRM that minimizes the matching cost function in (3). 
During shape optimization, we fix the VIRM parameters and find an optimal set 
of shape parameters that minimize the matching cost function. The iteration is 
terminated when the average vertex change after shape optimization is smaller 
than a preset threshold. 



4.3 VIRM and Shape Optimization 

When shape parameters are fixed, optimizing (12) to find VIRM is equivalent 
to solving a set of linear equations in least squares sense. Each visible triangle 
patch in one view gives a linear equation of Rd{n) and Rs{r). Because of the 
discretization, we let the equations constrain the nearest samples of VIRM. We 
filter out patches that have large viewing angles (> 80 degree in our experi- 
ments) to avoid poor constraints being used in estimating VIRM. The optimal 
solution gives estimates of all values of Rd{n) and i?s(r) on the sample grid. 
Some samples on the VIRM grid may not have any constraint; we obtain their 
values by interpolation. 
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Shape optimization in (12) for a fixed VIRM is a non-linear least squares 
problem. Again, for the same reason, patches that are tilted away from the 
camera are not used in computing Fmatching in (3). This won’t create many 
unconstrained patches though, since in a multi-camera configuration every patch 
must has some cameras facing toward it. We solve the optimization using the 
large scale optimization method called Trust Region Reflective Newton (TTRN) 
method [9]. In TTRN method, the matching cost function is viewed as the square 
of norm of a multi-input multi-output (MIMO) function. Every iteration of the 
optimization involves the approximate solution of a large linear system using the 
method of preconditioned conjugate gradients (PCG). TRRN method requires 
the Jacobian matrix of the MIMO function, and this can be computed using 
finite difference. 

Since each vertex on the mesh model has 3 degree-of-freedom, the number 
of parameters that represent the shape is 3 times the number of vertices. To 
reduce the number of parameters, we impose a restriction that each vertex can 
only move along a specific direction. This direction, called the weighted average 
normal direction (WAND), is the average of the surface normal vectors over 
all the triangles sharing the vertex, weighted by the areas of these triangles. 
In addition to reducing the number of shape parameters, this restriction also 
prevents vertices from clustering together during optimization. At each iteration, 
the visibilities and WANDs of vertices are updated according to the current 
estimate of the shape. Also, the visual hull computed from silhouettes is used as 
an outer bound of the shape being estimated. 



4.4 Multi-scale Processing 

To avoid local minima and for computational efficiency, we use multi-scale pro- 
cessing in the optimization. We first optimize the shape parameters using a 
coarse triangular mesh and use a low sampling rate for VIRM. Then we itera- 
tively reduce the triangle size and increase the VIRM sampling rate. Triangles 
having larger gray level variations at a coarse scale are subdivided into four 
small triangles to obtain finer scale triangles. They are constructed from 3 new 
vertices which are the mid-points of three edges of the coarse triangle. 

5 Experiments 

5.1 VIRM Validation 

We first perform a synthetic experiment to validate our VIRM model. A set 
of 20 images of a sphere is synthesized and used as the input to VIRM opti- 
mization described in 4.3. We assume the sphere radius are known and want 
to check whether the simplified VIRM model can reproduce the non-lambertian 
reflectance of the sphere. 

Fig. 4 shows four of the input sphere images as well as the corresponding 
images rendered using reconstructed VIRM. The sampling grid for diffuse VIRM 
is 18x9, and specular VIRM is 32x16. The result shows that the reconstruction 
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matches the originals well except for some highlights where image values are 
saturated, and some areas where viewing angles are large. The average absolute 
image difference between the input and reconstructed images over the entire set 
is 0.017 on a scale of [0,1]. The grid representation of the reconstructed VIRM 
is shown in Fig. 4(c). 



(a) 

(b) 




Fig. 4. (a) 4 of the 20 input sphere images (b) Sphere rendered using estimated VIRM 
(average absolute image difference over all 20 views is 0.017 on a scale of [0,1]) (c) 
Estimated specular(left) and diffuse(right) component of VIRM along a grid defined 
by longitude and latitude. 



5.2 Buddha Data Set (Synthetic) 

The Buddha data set is synthetic and consists of 24 views of a Buddha sculpture 
made from a single shiny material. The sculpture is illuminated by 60 directional 
light sources. Some input images are shown in Fig. 5. We run our algorithm at 
three different scales. The numbers of triangles at each scale are around 6300, 
21000, and 50000. The sampling grid for diffuse VIRM is 6x3, 12x6, 18x9, and 
specular VIRM is 12x6, 24x12, 32x16. The final reconstructed shape is also shown 
in Fig. 5, compared with the ground truth shape and the initial shape. 

By comparing Fig. 5(d-f) and 5(j-l), we can see that Buddha’s ears are not 
well recovered. Thin surface parts are difficult to recover since they do not cause 
enough image differences to affect the cost function. 

To obtain more quantitative measures of the performance of our algorithms, 
and to seperately evaluate the quality of shape and VIRM based estimates, we 
compute the range images of the reconstructed shape and the images of a sphere 
using the estimated VIRM for both the input viewpoints as well as some novel 
ones. They are compared with ground truth images. The synthesized gray scale 
image (Fig. 6a), range image (Fig. 6c) and sphere image (Fig. 6e) for one of the 
novel views are shown in Fig. 6. In Fig. 6(e) the specular highlights on the sphere 
are not fully recovered. One reason for this is that the surface normal along the 
surface of the sculpture is not continuous. For example, the shape does not have 
many surface normals facing downward, so the VIRM estimation is not well 
constrained in the corresponding direction. Low sample rate of VIRM, noise in 
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the recovered local surface orientation, and other noises such as shadow and 
inter-reflection that VIRM did not assume also contribute to the reconstruction 
error. 

We evaluate the performance of our algorithms using several measures 
(Fig. 7). We compute the average absolute pixel difference between ground truth 
and synthesized intensity images. Average Object Image Difference (AOID) and 
Average Sphere Image Difference (ASID) denote the differences for the rendered 
object and sphere images, respectively. AOID reflects the quality of both shape 
and VIRM estimates, whereas ASID reflects the quality of VIRM estimate. Ra- 
tio of Uncovered Area (RUA) is the percentage of the non-overlapping silhouette 







(g) (li) (i) 0) (k) (1) 

Fig. 5. (a, b, c): Three input images of the data set. (d, e, f): The ground truth 3D 
model rendered with a dull material to eliminate specularities, which makes visual 
evaluation of shape easier, (g, h, i): The initial 3D shape computed from silhouettes in 
the input images, (j, k, 1): The recovered 3D shape after optimization. 

CO 



(e) (f) 

Fig. 6. (a) Synthesized gray scale image with estimated VIRM and shape, (b) Ground 
truth image, (c) Range image computed from the estimated shape, (d) Range image 
computed from the ground truth shape, (e) Synthesized sphere with estimated VIRM. 
(f) Rendered sphere with ground truth material and lighting. All images are from a 
novel viewpoint. 







612 



T. Yu, N. Xu, and N. Ahuja 



areas between the ground truth and synthesized objects. Pixel values in these 
uncovered areas are not defined in either synthesized image or ground truth im- 
age, so we do not include them in the calculation of image differences. Finally, 
Average Range Image Difference (ARID) measures more directly the errors in 
estimated shape by computing average absolute object range difference between 
synthesized range images from estimated shape and those from ground truth. 
In Fig. 7(b), images with high ARID values are from views that have occluding 
boundaries. Since the recovered occluding boundaries are not fully aligned with 
the actual boundaries, they will create large differences in the range image. 





Fig. 7. The various performance measures shown for different viewpoints, (a): AOID, 
ASID and RUA (value normalized to [0,1]) (b): ARID (absolute value); the object’s 
bounding box is about 5x5x7 and distance to camera is 15. For both (a) and (b), 
datapoints 1-24 are from input views, and 25-27 are from novel views. 



We also synthesize an image from a novel viewpoint using estimated VIRM 
(Fig. 8a) and another image with VIRM rotated by 60 degree (Fig. 8c). The im- 
ages are compared with ground truth images in Fig. 8. Another object rendered 
using the same VIRM is shown in Fig. 8(e). 




Fig. 8. Synthesized novel view using estimated VIRM (a), novel view with VIRM 
rotated by 60 degree (c), to be compared with ground truth (b, d). (e) is another 
object synthesized using the same VIRM as used in (a). 
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5.3 Van Gogh Data Set (Real) 



The Van Gogh data set is by courtesy of J.-Y. Bouguet and R. Grzeszczuk 
(Intel). It consists of more than 300 calibrated images of a Van Gogh statue. 
We select 21 images taken from different directions. These images are manually 
segmented to remove the background and the silhouettes are used to compute 
the initial shape. We segment out the base of the statue since it is made of a 
different material. Three of the input images are shown in Fig. 9(a-c). We have 
the reconstruction result from laser scanning of the statue (Fig. 9(d-f)). The 
scanned shape is processed by manual mesh cleaning process to make a smooth 
surface. 

The minimization is done at two different scales. The numbers of triangles at 
the two scales are arround 10000 and 40000. Since the statue is made of polished 
metal, which exhibits a typical metal BRDF with almost no diffuse component, 
we choose a very low sampling rate for the diffuse part in VIRM. The sample 
grids at two scales for diffuse VIRM are 6x3 and 6x3, and specular VIRM are 
24x12 and 48x24. The reconstructed shape is shown in Fig. 9 (j-1). Note that 
calibration errors are present in the reconstruction and they affect both the 
recovered VIRM and the shape. 




(g) (h) (i) (j) (k) (1) 



Fig. 9. (a, b, c): Three of the input images, (d, e, f): Shape obtained by laser scanning 
rendered with a dull material for better shape comparison, (g, h, i):Initial shape of our 
algorithm computed from silhouettes of the input images, (j, k, 1): Reconstructed 3D 
shape of our algorithm. 
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Fig. 10. (a): Reconstructed gray scale image with estimated VIRM and shape from a 
novel viewpoint (b): Ground truth image from the same viewpoint, (c): Range image 
computed from estimated shape, (d): Range image obtained from a laser scan, (e): Syn- 
thesized sphere with the estimated VIRM. (f): Synthesized Buddha with the estimated 
VIRM. 




Fig. 11. The various performance measures of Van Gogh data set, shown for different 
viewpoints (a): AOID and RUA (value normalized to [0, 1]). (b): ARID (absolute value); 
the bounding box of the object is about 90x80x200, distance to camera is about 950. 
For both (a) and (b), data points 1-21 are from input images, and 22-24 are from novel 
viewpoints. 



We again use AOID and ARID defined in Section 5.2 to evaluate the per- 
formance of our algorithm. But since we do not have the lighting data from the 
original data set, we cannot compute the ASID. The synthesized gray scale image 
(Fig. 10a) and range image (Fig. 10c) for one novel view are shown below. We 
also synthesize the sphere image (Fig. lOe) and Buddha image (Fig. lOf) with 
the estimated VIRM. Performance measures for all viewpoints are summarized 
in Fig. 11 . 

This data set is also used in [3]. Interested readers can compare the two 
results. Our major improvements are the recovery of shape details, and since 
VIRM is estimated, we get a compact reflectance map that can synthesize images 
of any shape from any viewpoint. 
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6 Conclusion and Future Works 

In this paper we have proposed an algorithm to reconstruct 3D shape and the 
view independent reflectance map (VIRM) from multiple calibrated images of 
the object. We pose this problem as that of minimizing of difference between the 
input images and the synthesized images using estimated 3D shape and VIRM. 
VIRM is derived from Torrance-Sparrow model, and used as a simplified model 
for single material reflectance under distant lighting with no self-shadowing and 
inter-reflections. An iterative method is used to minimize the matching cost 
function in order to And the optimal shape and VIRM. Our algorithm does 
not require the light source to be known, and it can deal with non-lambertian 
reflectance. Experimental results on both synthetic and real objects show that 
our algorithm is effective in recovering the 3D shape and the VIRM information. 

Our ongoing and planned work includes the following. The estimated VIRM 
can be used to render other objects with the same material and lighting, or to 
create animations that are consistent with the original lightings. Alternatively, 
the material/lighting of the synthesized image can be changed by directly modi- 
fying VIRM. Other directions include taking into account the effect of shadowing 
and inter-reflection and allowing objects with multiple materials. 
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